Trix & Graphix

Tips for cleaning up an ascii file

Suposse you have a table in ASCII file such as this one:

element1x1 element1x2 element1x3

element2x2 element2x2 element2x3

element3x1, element3x2, element3x3

It's full of undesirable heterogeneities: tabs, comma instead of just spaces as columns delimiters, undesired spaces, empty lines... How can you homogenize this file using simple command-line Unix tools?

Well, the first thing is to remove the tabs. This is easy using the tr tool, which substitutes some character by others. To remove the tabs just type

tr "\t" " " < file.asc

and this will return a copy of the file but without tabs through the standard output. Similarly, to remove the "," symbols you can pipe the last command:

tr "\t" " " < file.asc | tr "," " "

Next thing is to remove the emply lines and some undesired spaces. This may be done using awk:

tr "\t" " " < file.asc | tr "," " " | awk '{ if(NF>0) {for(i=1;i<=NF;i++) printf "%s ", $i; printf "\n"}}'

Finally, last command adds an undesired space at the end of each line. It can be removed using sed:

tr "\t" " " < file | tr "," " " | awk '{ if(NF>0) {for(i=1;i<=NF;i++) printf "%s ", $i; printf "\n"}}' | sed 's/ $//'

Well, this is all. The final output you get through the standard output after running this command is

element1x1 element1x2 element1x3
element2x2 element2x2 element2x3
element3x1 element3x2 element3x3

Yes, I know, I have used a lot of tools with few explanations. Maybe in some post in the future I will explain some these tools in more detail...


Anonymous said...

What a great web log. I spend hours on the net reading blogs, about tons of various subjects. I have to first of all give praise to whoever created your theme and second of all to you for writing what i can only describe as an fabulous article. I honestly believe there is a skill to writing articles that only very few posses and honestly you got it. The combining of demonstrative and upper-class content is by all odds super rare with the astronomic amount of blogs on the cyberspace.

Ontureño said...

Wow, thanks a lot. I began this blog long time ago. I didn't expect it could be useful for anyone but me. I just use it as a notepad to write down the tips/ideas I find out in my day to day work.

Nevertheless, comments as yours make me feel useful. Perhaps I should take the blog more seriously, posting more often. The fact is that I have hundreds of ideas to post, but I have few time...

