What is a SNP?

I recently came across this video which I made during Software Carpentry Instructor Training last year… and it’s not terrible, so I thought I’d share it!

In other news, the map application, which I mentioned during my last post, is awesome! Total success. I can’t share it here, because the data isn’t yet public, but, trust me, it’s really really great. If you want to try with some of your own data, you can checkout the project GitHub site here: shiny-court-grapher.

The main difficulty of the project, figuring out how to incorporate maps and how to use geocoded data, is mostly behind us. ggplot is wonderous!

There’s Going to be an App for That!

This past year I got involved with Software Carpentry, a group which teaches basic research computing to scientists. I’ve helped Stephen Turner  teach a few RNA-Seq workshops through UVA’s BioConnector, and last March I taught my first independent-from-Stephen workshop with two other UVA instructors, Alex Koeppel and Zhuo Fu.

Teaching basic shell programming!

Teaching basic shell programming!

Look, up here at this line!

Look, up here at this line!

We had such a good time working together and with Bart Ragon and VP Nagraj from the UVA Health Sciences Library that we decided to keep meeting in this awesome HSL room every week.

Yes, that’s TWO screens for coding.

Doesn’t it look like the bridge of the Enterprise? Make it so.

At first we decided to keep meeting in order to review and debug each other’s code, but then I had a brainwave. Perhaps we could work on an independent project together?

We all have a basic familiarity with programming in R, so that seemed like the language of choice. At first I had envisioned some interactive modifications to LocusZoom — something along the lines of an app where you can learn more about particular SNPs by clicking or hovering over their representations on a graph. However, upon sober reflection, I realized that I am the only one in the main group that works in genomics, and the required amount of domain-specific background knowledge would be extremely high. Additionally, fiddling with such a feature rich program like LocusZoom might not make for a great starter project.

As part of my position at Public Health Sciences I work with both the Center for Public Health Genomics (CPHG) and the Institute of Law, Psychiatry and Public Policy (ILPPP). The specific project that I work on involves analyzing court data pertaining to mental health proceedings in the State of Virginia. It’s a very different domain from my other bioinformatics work, but it ends up being a perfect fit as a project to cut our teeth on app construction with R. The data tables are fairly straightforward, even if deeper understanding requires further domain knowledge. Also, there would be actual immediate public policy benefits to having interactive and layered representations of the data. (e.g. allowing lawmakers to see up-to-date graphs of commitment trends in their specific districts.)

So, from now on, the inter-departmental SWC project group (cool nickname pending), along with my colleague, Ashleigh Allen from the ILPPP, will be spending weekly meetings brainstorming/planning/building an app. We’re researching various R packages to help us toward our goal. Right now we are considering shiny.

I’ll keep both of my readers up to date on our project as it takes shape!

seesaw2

I am the Puzzle Master

I love the show Brooklyn 99. It’s really funny, and last week there was a puzzle. I really love puzzles.

Here’s the puzzle as Captain Holt describes it:
“There are 12 men on an island. 11 weigh exactly the same amount, but 1 of them is slightly lighter or heavier. You must figure out which… The island has no scales, but there is a see-saw. The exciting catch? You can only use it three times.”

In fact, you can watch him describe the puzzle here:

If you want to work on it yourself. Read no further, because I’m totally going to spoil it.

Amy’s Solution

One approach to this problem is described by the character Amy, before getting cut off:

“First you weigh six versus six –”

But, that’s clearly not it, because you will have spent one of your precious uses determining that Diffy (our nickname for the “special” islander) exists, without ruling anyone out!

Three Groups of Four

Unfortunately, this solution fails, because ultimately, you would need to know whether the mystery man, hereafter called Diffy, was heavier or lighter in order to be able to identify him at the end.

Four Equal Groups

I thought I had a really promising approach when I split the islanders into four groups of three, since, with directional information (i.e. the knowledge of whether he is heavier or lighter), you can force Diffy to reveal himself from within a group of three with just one weighing. Unfortunately, a friend pointed out that in one case, where Diffy was in the final group of three, I wouldn’t have the directional information I would need to pick him out in one move… so back to the drawing board.

Lessons

The final solution takes into account two things I learned from previous attempts:

  1. In a group of four, I can identify Diffy in two weighings.
    1. First, I set two islanders from the group against two known non-Diffys. If the see-saw tilts, I know that Diffy is one of these two. If the see-saw remains even, I know that Diffy is one of the other two.
    2. Now, I select one of the remaining two possible-Diffys and set him against a known non-Diffy. If the scale tilts, I have found Diffy. If the board remains even, I know that Diffy is that last remaining islander.
    3. Alternatively, if the see-saw tilts in Step A, and you want to know whether DIffy is heavy or light, you could note the direction from Step A and put the two remaining possible-Diffys on the scale opposite one another. If the see-saw tilts in the same direction as Step A, then Diffy is the one still on the same side as he was during Step A. Otherwise, if the orientation of the see-saw changes, Diffy is on the other side.
  2. In a group of three I can identify Diffy in one weighing, as long as I have directional information. I will describe this in further detail under Use #3.

SOLUTION

All of the Islanders

Because of lesson #1, I can split off four islanders before checking the rest. If Diffy is in that group of four, the first weighing will come out even, and I can now identify him from among those four with my two remaining moves. If Diffy is not in that group of four, I now have four islanders whom I can rule out and also use to tare my see-saw.

So, for my first use of the see-saw, I weigh the eight remaining islanders against each other with four on each side.

Use #1

Teeter Totter Use #1

I’ve already outlined my plan if this first see-saw usage turns out even, so what’s next if it turns out odd? This is where the genius comes in.

I now have some “directional information.” I will henceforth call whichever direction the see-saw tilted in Use 1 “Direction 1” or “D1” for short. I know that if Diffy is heavy, he is on the part of the see-saw that went down, and if if Diffy is light, he’s on the part of the see-saw that went up. If I move Diffy, the see-saw will change orientation! It has no choice because Diffy, and only Diffy, causes the see-saw to tilt. Also, remember lesson #2, I have directional information and one move after the current one, so I can totally take out three possible Diffys before the next use of the see-saw. I’ll need to use one of the islanders I ruled out in Use 1 in order to keep three islanders on each side.

Use #2

See-Saw Use #2

If Use #2 gives us an even see-saw we can find Diffy in the three we removed, but if it doesn’t, we need to pay attention to the direction that the see-saw moves. Did it move the same way as before, Direction 1, or did it change orientation to Direction 2? Our next choice will be based on the answer! If it moved in Direction 1, then we know that Diffy is not one of the islanders who switched sides for Use #2. If the see-saw moved in Direction 2, then Diffy is one of the side-switchers. Either way, we have got him down to being one of three or two. Use #3 is a little hard to generalize since it is different for each possibility.

Use #3

I would like it known that wordpress deleted the entirety of my description of Use #3, so any failings in this attempt to reproduce it are completely wordpress’s fault.

In the case where I have a group of three possible-Diffy islanders, two of those islanders were on the same side during Use #1, when the see-saw moved into D1. If I put one of these islanders on each side of the see-saw and the see-saw again moves into D1, then we know that Diffy is the islander on the original side. If the see-saw moves into D2, then we know that Diffy is on the opposite side of the see-saw. If the see-saw remains even, we know that Diffy is the third member of the group.

All Mapped Out

Weight-Obsessed Island Puzzle Solution

In conclusion,

NATURE GENETICS

Big News!

I’ve been horribly neglecting this blog as well as ThisScienceLife, and that’s totally going to change. In the meantime, I wanted to tell all three of you that I’m an author on a paper in Nature Genetics, published today.

Yes, that is what I said.

I *SAID* ‘Nature Genetics’.

If you would like to read it or gaze adoringly at my name, it’s called Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers, and it’s right here.

Dance! Dance for me Tiny Russian!

This is how we are over here right now.

Featured Image -- 89

It’s 10 o’clock — Do you know where your columns are?

I’ve made a few updates to the post about storing column locations, based on some input from a colleague.

If Only Life Had A Command-Line

"All things change, and we change with them." “All things change, and we change with them.”

Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)

Handy Line

At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:

Here’s what’s happening:

  1. sed "s/r//g" $DATAFILE – strip out any weird Windows carriage returns
  2. head -n1 – look only at the first (header) line
  3. sed 's/t/n/g' – replace all tab characters with…

View original post 155 more words

"All things change, and we change with them."

It’s 10 o’clock — Do you know where your columns are?

"All things change, and we change with them."

“All things change, and we change with them.”

 

Anyone who works in data analysis knows that any assumptions that you make about the formatting of the data that you receive are bound to be wrong. (Read: Assume the data came from a caveman, just to be safe.)

Handy Line

At a minimum, even if everything else is perfect (unlikely), the column names are probably not in the same order in every data set. So, rather than looking up the column number every time, I use the following line to store the number of the column of interest — in this case the “Chr” (chromosome) column — for later use throughout the script. It’s pretty basic, but super useful:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/\t/\n/g'| sed '/^\s*$/d' | awk '$0 ~/^Chr$/ {print NR}')

Here's what's happening:

  1. sed "s/\r//g" $DATAFILE – strip out any weird Windows carriage returns
  2. head -n1 – look only at the first (header) line
  3. sed 's/\t/\n/g' – replace all tab characters with newlines
  4. sed '/^\s*$/d' – strip out any empty lines (perhaps there is an extra separarator between two of the columns)
  5. awk '$0 ~/^Chr$/ {print NR}' – return the line number of the line that contains “Chr” exactly

Note: In the above example, I’ve assumed that the columns are tab-delimited, which is not always the case. In case your columns are space-delimited, #3 receives a slight alteration, so the line becomes:

chr=$( sed "s/\r//g" $DATAFILE | head -n1| sed 's/ /\n/g'| sed '/^\s*$/d'| awk '$0 ~/^Chr$/ {print NR}')

Real World Application

I enjoy making my scripts as adaptive as possible, and I think I’ve succeeded pretty well with this one here which transfers data from a Genome Studio “full data table” to PLiNK format. In fact, one of my colleagues, Wei-Min Chen, who doesn’t usually gush, called it “perfect”. I’ve written a whole long post discussing my solutions to the various challenges of the task, but, I’m not sure how many people will be interested… so I’ll keep it for another day.

cowsay_belazy

Lazy Shell Shortcuts

Because of the lonely way I learned shell programming (re-purposing and adapting code from others with extensive stack-overflow research), everything I know has to do with script writing. So, recently, while preparing for my first time teaching others about the shell, I had a chance to discuss command-line magic with a more experienced coder, Stephen Turner, and he told me about two basic shortcuts that I did not know.

1. Quickly Subsitute Strings to Adapt Previous Commands

What if I got ahold of a file containing a list of all the names of every student that ever attended Hogwarts (one per line) and I wrote a long command to discover how many students had “Neville” somewhere in their name?

$cat hogwarts.txt | grep "Neville"| wc -l

And now I want to run pretty much the same command, but with a small change — perhaps I want to use the Gryffindor specific list instead. Like this:

$cat gryffindor.txt | grep "Neville"| wc -l

For years, I have used the up arrow and then scrolled to the start of the command in order to change it. But, no longer.
Instead, I can use the ^ caret (on the 6 key) to substitute gryffindor for hogwarts in the previous command, like so:

$^hogwarts^gryffindor
cat gryffindor.txt | grep "Neville"| wc -l

And Boom, the shell prints out the new command and starts running it!

2. Use History as More Than Just a Record of Your Commands

Did you know that history will bring up a numbered list of your command history? Well, I did. I did not know that you can use those numbers to reissue commands. How cool is that? Answer: So cool. Here’s how it works. First call up your history:

$history
1 ls
2 cd ~
3 plink --out ravenclaws --keep ravenclaw.txt --bfile hogwarts --freq
4 cowsay "Be Lazy!"
5 cat ravenclaw.log
6 head ravenclaw.freq

Take note of the number beside the command you wish to rerun, in this case, obviously, the cowsay command. Now, use an ! exclamation mark and that number to reissue that command:

$!4
cowsay "Be Lazy!"

cowsay_belazy

Those of you that already knew such magic, I congratulate you. Any of you that didn’t, I hope these serve you well!