Addicted to Public Health

A couple years ago I started becoming obsessed with the opioid epidemic. I spend a non-trivial amount of my time thinking about it and if I am ever scrambling for a topic in a social situation, it ends up being pretty much the only thing I can think of. (Because I am So Smooth.) As a public health devotee, the epidemic hits all of my passions. To name just a few: Issue exacerbated by outdated stigma? Check. Multiple demographics impacted? Big time. Heartbreaking narratives? You will never stop crying. Socioeconomic confounders? And how.

Recently, alarm over the epidemic has reached such a fever pitch that various agencies have started hosting opioid-crisis-focused datathons/codathons/hackathons. (Three different new words for essentially the same concept does seem a little excessive, I agree.)

Being a data scientist is pretty much the coolest thing, and the world seems to have caught on to this fact. This means that there is more competition to do the really interesting work, but it also means that there is a critical mass of data scientists who will be interested in sciencing together for a marathon period (24 hours straight at least) on particular topics in our personal time.  I love attending them. They are a great opportunity to build less-exercised skills through a sudden flood of experience hours. So, you can just imagine how I feel about getting access to new opioid epidemic data as part of a hackathon. To spell it out: Teamwork + Opioid Epidemic + Data + Hackathon = G.O.A.T.

I have participated in 2 Opioid Hackathons in the past 6 months, and I and some of my coworkers are planning one of our own, sponsored by our company. One of my kick-ass data science colleagues, Catherine Ordun, submitted our results for these events in an abstract to the International Society for Disease Surveillance (ISDS) Conference in Orlando this year, and we are presenting tomorrow! I’ll be back to post more about the topic (unless I take another 3 year hiatus, obvi.)

Screen Shot 2018-01-30 at 10.31.03 PM


There’s Going to be an App for That!

This past year I got involved with Software Carpentry, a group which teaches basic research computing to scientists. I’ve helped Stephen Turner  teach a few RNA-Seq workshops through UVA’s BioConnector, and last March I taught my first independent-from-Stephen workshop with two other UVA instructors, Alex Koeppel and Zhuo Fu.

Teaching basic shell programming!

Teaching basic shell programming!

Look, up here at this line!

Look, up here at this line!

We had such a good time working together and with Bart Ragon and VP Nagraj from the UVA Health Sciences Library that we decided to keep meeting in this awesome HSL room every week.

Yes, that’s TWO screens for coding.

Doesn’t it look like the bridge of the Enterprise? Make it so.

At first we decided to keep meeting in order to review and debug each other’s code, but then I had a brainwave. Perhaps we could work on an independent project together?

We all have a basic familiarity with programming in R, so that seemed like the language of choice. At first I had envisioned some interactive modifications to LocusZoom — something along the lines of an app where you can learn more about particular SNPs by clicking or hovering over their representations on a graph. However, upon sober reflection, I realized that I am the only one in the main group that works in genomics, and the required amount of domain-specific background knowledge would be extremely high. Additionally, fiddling with such a feature rich program like LocusZoom might not make for a great starter project.

As part of my position at Public Health Sciences I work with both the Center for Public Health Genomics (CPHG) and the Institute of Law, Psychiatry and Public Policy (ILPPP). The specific project that I work on involves analyzing court data pertaining to mental health proceedings in the State of Virginia. It’s a very different domain from my other bioinformatics work, but it ends up being a perfect fit as a project to cut our teeth on app construction with R. The data tables are fairly straightforward, even if deeper understanding requires further domain knowledge. Also, there would be actual immediate public policy benefits to having interactive and layered representations of the data. (e.g. allowing lawmakers to see up-to-date graphs of commitment trends in their specific districts.)

So, from now on, the inter-departmental SWC project group (cool nickname pending), along with my colleague, Ashleigh Allen from the ILPPP, will be spending weekly meetings brainstorming/planning/building an app. We’re researching various R packages to help us toward our goal. Right now we are considering shiny.

I’ll keep both of my readers up to date on our project as it takes shape!

It’s 10 o’clock — Do you know where your columns are?

"All things change, and we change with them."

“All things change, and we change with them.”


Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)

Handy Line

At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/\t/\n/g'| sed '/^\s*$/d' | awk '$0 ~/^Chr$/ {print NR}')

Here's what's happening:

  1. sed "s/\r//g" $DATAFILE – strip out any weird Windows carriage returns
  2. head -n1 – look only at the first (header) line
  3. sed 's/\t/\n/g' – replace all tab characters with newlines
  4. sed '/^\s*$/d' – strip out any empty lines (perhaps there is an extra separarator between two of the columns)
  5. awk '$0 ~/^Chr$/ {print NR}' – return the line number of the line that contains “Chr” exactly

Note: In the above example, I’ve assumed that the columns are tab-delimited, which is not always the case. In case your columns are space-delimited, #3 receives a slight alteration, so the line becomes:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/ /\n/g'| sed '/^\s*$/d'| awk '$0 ~/^Chr$/ {print NR}')

Real World Application

I enjoy making my scripts as adaptive as possible, and I think I’ve succeeded pretty well with this one here which transfers data from a Genome Studio “full data table” to PLiNK format. In fact, one of my colleagues, Wei-Min Chen, who doesn’t usually gush, called it “perfect”. I’ve written a whole long post discussing my solutions to the various challenges of the task, but, I’m not sure how many people will be interested… so I’ll keep it for another day.

Lazy Shell Shortcuts

Because of the lonely way I learned shell programming (re-purposing and adapting code from others with extensive stack-overflow research), everything I know has to do with script writing. So, recently, while preparing for my first time teaching others about the shell, I had a chance to discuss command-line magic with a more experienced coder, Stephen Turner, and he told me about two basic shortcuts that I did not know.

1. Quickly Subsitute Strings to Adapt Previous Commands

What if I got ahold of a file containing a list of all the names of every student that ever attended Hogwarts (one per line) and I wrote a long command to discover how many students had “Neville” somewhere in their name?

$cat hogwarts.txt | grep "Neville"| wc -l

And now I want to run pretty much the same command, but with a small change — perhaps I want to use the Gryffindor specific list instead. Like this:

$cat gryffindor.txt | grep "Neville"| wc -l

For years, I have used the up arrow and then scrolled to the start of the command in order to change it. But, no longer.
Instead, I can use the ^ caret (on the 6 key) to substitute gryffindor for hogwarts in the previous command, like so:

cat gryffindor.txt | grep "Neville"| wc -l

And Boom, the shell prints out the new command and starts running it!

2. Use History as More Than Just a Record of Your Commands

Did you know that history will bring up a numbered list of your command history? Well, I did. I did not know that you can use those numbers to reissue commands. How cool is that? Answer: So cool. Here’s how it works. First call up your history:

1 ls
2 cd ~
3 plink --out ravenclaws --keep ravenclaw.txt --bfile hogwarts --freq
4 cowsay "Be Lazy!"
5 cat ravenclaw.log
6 head ravenclaw.freq

Take note of the number beside the command you wish to rerun, in this case, obviously, the cowsay command. Now, use an ! exclamation mark and that number to reissue that command:

cowsay "Be Lazy!"


Those of you that already knew such magic, I congratulate you. Any of you that didn’t, I hope these serve you well!