Addicted to Public Health

A couple years ago I started becoming obsessed with the opioid epidemic. I spend a non-trivial amount of my time thinking about it and if I am ever scrambling for a topic in a social situation, it ends up being pretty much the only thing I can think of. (Because I am So Smooth.) As a public health devotee, the epidemic hits all of my passions. To name just a few: Issue exacerbated by outdated stigma? Check. Multiple demographics impacted? Big time. Heartbreaking narratives? You will never stop crying. Socioeconomic confounders? And how.

Recently, alarm over the epidemic has reached such a fever pitch that various agencies have started hosting opioid-crisis-focused datathons/codathons/hackathons. (Three different new words for essentially the same concept does seem a little excessive, I agree.)

Being a data scientist is pretty much the coolest thing, and the world seems to have caught on to this fact. This means that there is more competition to do the really interesting work, but it also means that there is a critical mass of data scientists who will be interested in sciencing together for a marathon period (24 hours straight at least) on particular topics in our personal time.  I love attending them. They are a great opportunity to build less-exercised skills through a sudden flood of experience hours. So, you can just imagine how I feel about getting access to new opioid epidemic data as part of a hackathon. To spell it out: Teamwork + Opioid Epidemic + Data + Hackathon = G.O.A.T.

I have participated in 2 Opioid Hackathons in the past 6 months, and I and some of my coworkers are planning one of our own, sponsored by our company. One of my kick-ass data science colleagues, Catherine Ordun, submitted our results for these events in an abstract to the International Society for Disease Surveillance (ISDS) Conference in Orlando this year, and we are presenting tomorrow! I’ll be back to post more about the topic (unless I take another 3 year hiatus, obvi.)

Screen Shot 2018-01-30 at 10.31.03 PM

Advertisements

What is a SNP?

I recently came across this video which I made during Software Carpentry Instructor Training last year… and it’s not terrible, so I thought I’d share it!

In other news, the map application, which I mentioned during my last post, is awesome! Total success. I can’t share it here, because the data isn’t yet public, but, trust me, it’s really really great. If you want to try with some of your own data, you can checkout the project GitHub site here: shiny-court-grapher.

The main difficulty of the project, figuring out how to incorporate maps and how to use geocoded data, is mostly behind us. ggplot is wonderous!

There’s Going to be an App for That!

This past year I got involved with Software Carpentry, a group which teaches basic research computing to scientists. I’ve helped Stephen Turner  teach a few RNA-Seq workshops through UVA’s BioConnector, and last March I taught my first independent-from-Stephen workshop with two other UVA instructors, Alex Koeppel and Zhuo Fu.

Teaching basic shell programming!

Teaching basic shell programming!

Look, up here at this line!

Look, up here at this line!

We had such a good time working together and with Bart Ragon and VP Nagraj from the UVA Health Sciences Library that we decided to keep meeting in this awesome HSL room every week.

Yes, that’s TWO screens for coding.

Doesn’t it look like the bridge of the Enterprise? Make it so.

At first we decided to keep meeting in order to review and debug each other’s code, but then I had a brainwave. Perhaps we could work on an independent project together?

We all have a basic familiarity with programming in R, so that seemed like the language of choice. At first I had envisioned some interactive modifications to LocusZoom — something along the lines of an app where you can learn more about particular SNPs by clicking or hovering over their representations on a graph. However, upon sober reflection, I realized that I am the only one in the main group that works in genomics, and the required amount of domain-specific background knowledge would be extremely high. Additionally, fiddling with such a feature rich program like LocusZoom might not make for a great starter project.

As part of my position at Public Health Sciences I work with both the Center for Public Health Genomics (CPHG) and the Institute of Law, Psychiatry and Public Policy (ILPPP). The specific project that I work on involves analyzing court data pertaining to mental health proceedings in the State of Virginia. It’s a very different domain from my other bioinformatics work, but it ends up being a perfect fit as a project to cut our teeth on app construction with R. The data tables are fairly straightforward, even if deeper understanding requires further domain knowledge. Also, there would be actual immediate public policy benefits to having interactive and layered representations of the data. (e.g. allowing lawmakers to see up-to-date graphs of commitment trends in their specific districts.)

So, from now on, the inter-departmental SWC project group (cool nickname pending), along with my colleague, Ashleigh Allen from the ILPPP, will be spending weekly meetings brainstorming/planning/building an app. We’re researching various R packages to help us toward our goal. Right now we are considering shiny.

I’ll keep both of my readers up to date on our project as it takes shape!

I am the Puzzle Master

I love the show Brooklyn 99. It’s really funny, and last week there was a puzzle. I really love puzzles.

Here’s the puzzle as Captain Holt describes it:
“There are 12 men on an island. 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier. You must figure out which… The island has no scales, but there is a see-saw. The exciting catch? You can only use it three times.”

In fact, you can watch him describe the puzzle here:

If you want to work on it yourself. Read no further, because I’m totally going to spoil it.

Amy’s Solution

One approach to this problem is described by the character Amy, before getting cut off:

“First you weigh six versus six –”

But, that’s clearly not it, because you will have spent one of your precious uses determining that Diffy (our nickname for the “special” islander) exists, without ruling anyone out!

Three Groups of Four

Unfortunately, this solution fails, because ultimately, you would need to know whether the mystery man, hereafter called Diffy, was heavier or lighter in order to be able to identify him at the end.

Four Equal Groups

I thought I had a really promising approach when I split the islanders into four groups of three, since, with directional information (i.e. the knowledge of whether he is heavier or lighter), you can force Diffy to reveal himself from within a group of three with just one weighing. Unfortunately, a friend pointed out that in one case, where Diffy was in the final group of three, I wouldn’t have the directional information I would need to pick him out in one move… so back to the drawing board.

Lessons

The final solution takes into account two things I learned from previous attempts:

  1. In a group of four, I can identify Diffy in two weighings.
    1. First, I set two islanders from the group against two known non-Diffys. If the see-saw tilts, I know that Diffy is one of these two. If the see-saw remains even, I know that Diffy is one of the other two.
    2. Now, I select one of the remaining two possible-Diffys and set him against a known non-Diffy. If the scale tilts, I have found Diffy. If the board remains even, I know that Diffy is that last remaining islander.
    3. Alternatively, if the see-saw tilts in Step A, and you want to know whether DIffy is heavy or light, you could note the direction from Step A and put the two remaining possible-Diffys on the scale opposite one another. If the see-saw tilts in the same direction as Step A, then Diffy is the one still on the same side as he was during Step A. Otherwise, if the orientation of the see-saw changes, Diffy is on the other side.
  2. In a group of three I can identify Diffy in one weighing, as long as I have directional information. I will describe this in further detail under Use #3.

SOLUTION

All of the Islanders

Because of lesson #1, I can split off four islanders before checking the rest. If Diffy is in that group of four, the first weighing will come out even, and I can now identify him from among those four with my two remaining moves. If Diffy is not in that group of four, I now have four islanders whom I can rule out and also use to tare my see-saw.

So, for my first use of the see-saw, I weigh the eight remaining islanders against each other with four on each side.

Use #1

Teeter Totter Use #1

I’ve already outlined my plan if this first see-saw usage turns out even, so what’s next if it turns out odd? This is where the genius comes in.

I now have some “directional information.” I will henceforth call whichever direction the see-saw tilted in Use 1 “Direction 1” or “D1” for short. I know that if Diffy is heavy, he is on the part of the see-saw that went down, and if if Diffy is light, he’s on the part of the see-saw that went up. If I move Diffy, the see-saw will change orientation! It has no choice because Diffy, and only Diffy, causes the see-saw to tilt. Also, remember lesson #2, I have directional information and one move after the current one, so I can totally take out three possible Diffys before the next use of the see-saw. I’ll need to use one of the islanders I ruled out in Use 1 in order to keep three islanders on each side.

Use #2

See-Saw Use #2

If Use #2 gives us an even see-saw we can find Diffy in the three we removed, but if it doesn’t, we need to pay attention to the direction that the see-saw moves. Did it move the same way as before, Direction 1, or did it change orientation to Direction 2? Our next choice will be based on the answer! If it moved in Direction 1, then we know that Diffy is not one of the islanders who switched sides for Use #2. If the see-saw moved in Direction 2, then Diffy is one of the side-switchers. Either way, we have got him down to being one of three or two. Use #3 is a little hard to generalize since it is different for each possibility.

Use #3

I would like it known that wordpress deleted the entirety of my description of Use #3, so any failings in this attempt to reproduce it are completely wordpress’s fault.

In the case where I have a group of three possible-Diffy islanders, two of those islanders were on the same side during Use #1, when the see-saw moved into D1. If I put one of these islanders on each side of the see-saw and the see-saw again moves into D1, then we know that Diffy is the islander on the original side. If the see-saw moves into D2, then we know that Diffy is on the opposite side of the see-saw. If the see-saw remains even, we know that Diffy is the third member of the group.

All Mapped Out

Weight-Obsessed Island Puzzle Solution

In conclusion,

NATURE GENETICS

Big News!

I’ve been horribly neglecting this blog as well as ThisScienceLife, and that’s totally going to change. In the meantime, I wanted to tell all three of you that I’m an author on a paper in Nature Genetics, published today.

Yes, that is what I said.

I *SAID* ‘Nature Genetics’.

If you would like to read it or gaze adoringly at my name, it’s called Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers, and it’s right here.

Dance! Dance for me Tiny Russian!

This is how we are over here right now.

It’s 10 o’clock — Do you know where your columns are?

I’ve made a few updates to the post about storing column locations, based on some input from a colleague.

If Only Life Had A Command-Line

"All things change, and we change with them." “All things change, and we change with them.”

Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)

Handy Line

At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:

Here’s what’s happening:

  1. sed "s/r//g" $DATAFILE – strip out any weird Windows carriage returns
  2. head -n1 – look only at the first (header) line
  3. sed 's/t/n/g' – replace all tab characters with…

View original post 155 more words

It’s 10 o’clock — Do you know where your columns are?

"All things change, and we change with them."

“All things change, and we change with them.”

 

Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)

Handy Line

At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/\t/\n/g'| sed '/^\s*$/d' | awk '$0 ~/^Chr$/ {print NR}')

Here's what's happening:

  1. sed "s/\r//g" $DATAFILE – strip out any weird Windows carriage returns
  2. head -n1 – look only at the first (header) line
  3. sed 's/\t/\n/g' – replace all tab characters with newlines
  4. sed '/^\s*$/d' – strip out any empty lines (perhaps there is an extra separarator between two of the columns)
  5. awk '$0 ~/^Chr$/ {print NR}' – return the line number of the line that contains “Chr” exactly

Note: In the above example, I’ve assumed that the columns are tab-delimited, which is not always the case. In case your columns are space-delimited, #3 receives a slight alteration, so the line becomes:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/ /\n/g'| sed '/^\s*$/d'| awk '$0 ~/^Chr$/ {print NR}')

Real World Application

I enjoy making my scripts as adaptive as possible, and I think I’ve succeeded pretty well with this one here which transfers data from a Genome Studio “full data table” to PLiNK format. In fact, one of my colleagues, Wei-Min Chen, who doesn’t usually gush, called it “perfect”. I’ve written a whole long post discussing my solutions to the various challenges of the task, but, I’m not sure how many people will be interested… so I’ll keep it for another day.