Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)
Handy Line
At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:
chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/\t/\n/g'| sed '/^\s*$/d' | awk '$0 ~/^Chr$/ {print NR}')
Here's what's happening:
sed "s/\r//g" $DATAFILE
– strip out any weird Windows carriage returnshead -n1
– look only at the first (header) linesed 's/\t/\n/g'
– replace all tab characters with newlinessed '/^\s*$/d'
– strip out any empty lines (perhaps there is an extra separarator between two of the columns)awk '$0 ~/^Chr$/ {print NR}'
– return the line number of the line that contains “Chr” exactly
Note: In the above example, I’ve assumed that the columns are tab-delimited, which is not always the case. In case your columns are space-delimited, #3 receives a slight alteration, so the line becomes:
chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/ /\n/g'| sed '/^\s*$/d'| awk '$0 ~/^Chr$/ {print NR}')
Real World Application
I enjoy making my scripts as adaptive as possible, and I think I’ve succeeded pretty well with this one here which transfers data from a Genome Studio “full data table” to PLiNK format. In fact, one of my colleagues, Wei-Min Chen, who doesn’t usually gush, called it “perfect”. I’ve written a whole long post discussing my solutions to the various challenges of the task, but, I’m not sure how many people will be interested… so I’ll keep it for another day.
Reblogged this on If Only Life Had A Command-Line and commented:
I’ve made a few updates to the post about storing column locations, based on some input from a colleague.
LikeLike