Addicted to Public Health

A couple years ago I started becoming obsessed with the opioid epidemic. I spend a non-trivial amount of my time thinking about it and if I am ever scrambling for a topic in a social situation, it ends up being pretty much the only thing I can think of. (Because I am So Smooth.) As a public health devotee, the epidemic hits all of my passions. To name just a few: Issue exacerbated by outdated stigma? Check. Multiple demographics impacted? Big time. Heartbreaking narratives? You will never stop crying. Socioeconomic confounders? And how.

Recently, alarm over the epidemic has reached such a fever pitch that various agencies have started hosting opioid-crisis-focused datathons/codathons/hackathons. (Three different new words for essentially the same concept does seem a little excessive, I agree.)

Being a data scientist is pretty much the coolest thing, and the world seems to have caught on to this fact. This means that there is more competition to do the really interesting work, but it also means that there is a critical mass of data scientists who will be interested in sciencing together for a marathon period (24 hours straight at least) on particular topics in our personal time.  I love attending them. They are a great opportunity to build less-exercised skills through a sudden flood of experience hours. So, you can just imagine how I feel about getting access to new opioid epidemic data as part of a hackathon. To spell it out: Teamwork + Opioid Epidemic + Data + Hackathon = G.O.A.T.

I have participated in 2 Opioid Hackathons in the past 6 months, and I and some of my coworkers are planning one of our own, sponsored by our company. One of my kick-ass data science colleagues, Catherine Ordun, submitted our results for these events in an abstract to the International Society for Disease Surveillance (ISDS) Conference in Orlando this year, and we are presenting tomorrow! I’ll be back to post more about the topic (unless I take another 3 year hiatus, obvi.)

Screen Shot 2018-01-30 at 10.31.03 PM


I am the Puzzle Master

I love the show Brooklyn 99. It’s really funny, and last week there was a puzzle. I really love puzzles.

Here’s the puzzle as Captain Holt describes it:
“There are 12 men on an island. 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier. You must figure out which… The island has no scales, but there is a see-saw. The exciting catch? You can only use it three times.”

In fact, you can watch him describe the puzzle here:

If you want to work on it yourself. Read no further, because I’m totally going to spoil it.

Amy’s Solution

One approach to this problem is described by the character Amy, before getting cut off:

“First you weigh six versus six –”

But, that’s clearly not it, because you will have spent one of your precious uses determining that Diffy (our nickname for the “special” islander) exists, without ruling anyone out!

Three Groups of Four

Unfortunately, this solution fails, because ultimately, you would need to know whether the mystery man, hereafter called Diffy, was heavier or lighter in order to be able to identify him at the end.

Four Equal Groups

I thought I had a really promising approach when I split the islanders into four groups of three, since, with directional information (i.e. the knowledge of whether he is heavier or lighter), you can force Diffy to reveal himself from within a group of three with just one weighing. Unfortunately, a friend pointed out that in one case, where Diffy was in the final group of three, I wouldn’t have the directional information I would need to pick him out in one move… so back to the drawing board.


The final solution takes into account two things I learned from previous attempts:

  1. In a group of four, I can identify Diffy in two weighings.
    1. First, I set two islanders from the group against two known non-Diffys. If the see-saw tilts, I know that Diffy is one of these two. If the see-saw remains even, I know that Diffy is one of the other two.
    2. Now, I select one of the remaining two possible-Diffys and set him against a known non-Diffy. If the scale tilts, I have found Diffy. If the board remains even, I know that Diffy is that last remaining islander.
    3. Alternatively, if the see-saw tilts in Step A, and you want to know whether DIffy is heavy or light, you could note the direction from Step A and put the two remaining possible-Diffys on the scale opposite one another. If the see-saw tilts in the same direction as Step A, then Diffy is the one still on the same side as he was during Step A. Otherwise, if the orientation of the see-saw changes, Diffy is on the other side.
  2. In a group of three I can identify Diffy in one weighing, as long as I have directional information. I will describe this in further detail under Use #3.


All of the Islanders

Because of lesson #1, I can split off four islanders before checking the rest. If Diffy is in that group of four, the first weighing will come out even, and I can now identify him from among those four with my two remaining moves. If Diffy is not in that group of four, I now have four islanders whom I can rule out and also use to tare my see-saw.

So, for my first use of the see-saw, I weigh the eight remaining islanders against each other with four on each side.

Use #1

Teeter Totter Use #1

I’ve already outlined my plan if this first see-saw usage turns out even, so what’s next if it turns out odd? This is where the genius comes in.

I now have some “directional information.” I will henceforth call whichever direction the see-saw tilted in Use 1 “Direction 1” or “D1” for short. I know that if Diffy is heavy, he is on the part of the see-saw that went down, and if if Diffy is light, he’s on the part of the see-saw that went up. If I move Diffy, the see-saw will change orientation! It has no choice because Diffy, and only Diffy, causes the see-saw to tilt. Also, remember lesson #2, I have directional information and one move after the current one, so I can totally take out three possible Diffys before the next use of the see-saw. I’ll need to use one of the islanders I ruled out in Use 1 in order to keep three islanders on each side.

Use #2

See-Saw Use #2

If Use #2 gives us an even see-saw we can find Diffy in the three we removed, but if it doesn’t, we need to pay attention to the direction that the see-saw moves. Did it move the same way as before, Direction 1, or did it change orientation to Direction 2? Our next choice will be based on the answer! If it moved in Direction 1, then we know that Diffy is not one of the islanders who switched sides for Use #2. If the see-saw moved in Direction 2, then Diffy is one of the side-switchers. Either way, we have got him down to being one of three or two. Use #3 is a little hard to generalize since it is different for each possibility.

Use #3

I would like it known that wordpress deleted the entirety of my description of Use #3, so any failings in this attempt to reproduce it are completely wordpress’s fault.

In the case where I have a group of three possible-Diffy islanders, two of those islanders were on the same side during Use #1, when the see-saw moved into D1. If I put one of these islanders on each side of the see-saw and the see-saw again moves into D1, then we know that Diffy is the islander on the original side. If the see-saw moves into D2, then we know that Diffy is on the opposite side of the see-saw. If the see-saw remains even, we know that Diffy is the third member of the group.

All Mapped Out

Weight-Obsessed Island Puzzle Solution

In conclusion,

It’s 10 o’clock — Do you know where your columns are?

"All things change, and we change with them."

“All things change, and we change with them.”


Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)

Handy Line

At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/\t/\n/g'| sed '/^\s*$/d' | awk '$0 ~/^Chr$/ {print NR}')

Here's what's happening:

  1. sed "s/\r//g" $DATAFILE – strip out any weird Windows carriage returns
  2. head -n1 – look only at the first (header) line
  3. sed 's/\t/\n/g' – replace all tab characters with newlines
  4. sed '/^\s*$/d' – strip out any empty lines (perhaps there is an extra separarator between two of the columns)
  5. awk '$0 ~/^Chr$/ {print NR}' – return the line number of the line that contains “Chr” exactly

Note: In the above example, I’ve assumed that the columns are tab-delimited, which is not always the case. In case your columns are space-delimited, #3 receives a slight alteration, so the line becomes:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/ /\n/g'| sed '/^\s*$/d'| awk '$0 ~/^Chr$/ {print NR}')

Real World Application

I enjoy making my scripts as adaptive as possible, and I think I’ve succeeded pretty well with this one here which transfers data from a Genome Studio “full data table” to PLiNK format. In fact, one of my colleagues, Wei-Min Chen, who doesn’t usually gush, called it “perfect”. I’ve written a whole long post discussing my solutions to the various challenges of the task, but, I’m not sure how many people will be interested… so I’ll keep it for another day.