The first post in this series begins with deconstructing two influential papers with respect to statistics, bioinformatics and organizational behavior. Why? Because these papers provide an awesome learning opportunity to avoid many of the ills of academic bioscience, as observed by Yours Truly.
In these posts I’ll be expounding on how to AVOID MESSING UP when analyzing data in ways so bad that your career becomes deeply compromised, your PI hates you, your university wishes it had never heard of you, and future funding becomes questionable at best. So, below is the first installment. Comments welcome!
The bad JUJU of Batch Effects
Recently, Gilad and Mizrahi-Man (“G&M” from hereon) published a fascinating re-analysis1 of the data underlying the “surprising” findings from Lin et al.2, researchers in the Snyder Lab at Stanford University. The “surprise” involved the discovery that the expression of genes across species appeared to be more heavily influenced by the species than by the source tissue. As stated in the Lin paper:
“Overall, our results indicate that for the human–mouse comparison, tissues appear more similar to one another within the same species than to the comparable organs of other species when examining a more complete set of tissue types.”
Now, G&M used the word “surprising” because several studies had reported the converse finding: gene expression tends to cluster by tissue, not species. That is, they claim that the influence of the species is greater than that of the tissue, which is weird. Given that the Lin paper was published under the aegis of the massive ENCODE project (~$400M3), such a finding was sure to be influential and receive considerable attention.
And attention it got, wherein G&M decided to dig into Lin et al.’s data to try to eliminate the possibility of alternative competing explanations for the finding. It is one of the marvels of the age that one is now increasingly able to obtain raw data and reverse engineer a paper’s analysis, though much too rarely so far (more on this topic later).
What G&M found was a whole goat rodeo of problems, many of a highly sophomoric nature, ranging from basic statistical errors to poor programming. Consequently, and at a minimum, the original findings of Lin et al. must now be substantially discounted until further validation.
Working from the original FASTQ sequencing files generated by the RNA-Seq process, G&M were able to reconstruct the sequencing study design from sequence identifiers present in those files (thank God for digital biology)*. This allowed them to determine when the samples were sequenced, on which machine, and in which lanes, yielding the sample allocation design depicted in Figure 1. This is reproduced from the G&M paper and shows that samples were sequenced in multiple batches, BUT that only one batch (out of five) included samples from BOTH species.
At this point, you should start getting nervous, because this design fails to control for one of the most basic sources of confounding influence, namely, batch effects. In statistics, a confounder is an extraneous variable in a statistical model (e.g., a linear regression) that correlates with both the dependent variable (here, gene expression) and the independent variable (here, the samples from different tissues and species).
Figure 1: Original sequencing plan as inferred by Gilad and Mizrahi-Man, 20151. Only one flow cell is used to sequence a combination of human and mouse samples. Such a design creates well-known, and highly confounding, batch effects. I’ve added the red rectangle to highly the different instruments.
A more rigorous definition is provided in a seminal review by Leek et al. entitled “Tackling the widespread and critical impact of batch effects in high-throughput data“4 (sound relevant?). Published five years ago, and with 420 times citations as of this writing according to Google Scholar (read: VERY INFLUENTIAL), it states that “Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study.” In other words, part of the observed variation in a signal is not due to the natural phenomenon under investigation, but rather to the experiment design.
Now, batch effects can creep in even with mindful experiment design, but in the Lin paper, there was no attempt to minimize biases resulting from the non-random grouping of samples. A correct design so would have involved randomly assigning samples to lanes and sequencers to minimize biases that can result from how a sequencer or lane performed one day to the next. Such biases are not restricted to sequencers per se, but include anything that pertains to generating the data (the operator, temperature in the room, day of the week, etc, etc). The general principle here: RANDOMIZATION IS YOUR FRIEND. Always remember that.
The funny thing is, batch effects are not a new problem, and they aren’t specific to high throughput instruments, as Leek et al. point out:
“Although batch effects are difficult or impossible to detect in low-dimensional assays, high throughput technologies provide enough data to detect and even remove them. However, if not properly dealt with, these effects can have a particularly strong and pervasive impact. Specific examples have been documented in published studies in which the biological variables were extremely correlated with technical variables, which subsequently led to serious concerns about the validity of the biological conclusions”. (emphasis added)
I include the bit about published experiments failing to correct for batch effects because such a failure definitely seems to be at play in the Lin paper. An interesting side bar here is that if your data becomes publicly available (as is increasingly the norm), AND there is enough of it (e.g., you’re using a high throughput instrument), any random researcher will have a good chance of determining whether you have messed up. Scary, huh? Let that provide motivation for doing it right, ’cause your sins may be plastered for all to see. More on that, below.
So, back to the re-analysis: G&M corrected for sequencing sample confounding by using the ‘ComBat’ function from the sva package5 (BioConductor is a suite of R packages for biomedical data analysis). This function was used to condition the data for inspection using Principal Component Analysis (PCA; example from here). As further explained by G&M:
“To visualize the data, we used the function ‘prcomp’ (with the ‘scale’ and ‘center’ options set to TRUE) to perform principal component analysis (PCA) of the transposed log-transformed matrix of ‘clean’ values (after removal of invariant columns, i.e. genes), and the ggplot2 package to generate scatter plots of the PCA results. None of the first five principal components (accounting together for 56% of the variability in the data) support the clustering of the gene expression data by species”. (emphasis added)
In short, a re-analysis that accounted for the original sample allocation showed that more than half the variance observed was not due to the species, thus dramatically weakening the purported “species over tissue” effect. It’s worth noting that PCA, a widely used technique, was used by Lin et al. themselves in their paper.
Thus, it sure looks like Lin et al. fell into the trap described by Leek et al. Now, you have to wonder about this, ’cause there is nothing subtle going on. Batch effects have been analyzed, published, cited, and discussed for years, and it’s not like Stanford is devoid of world-class statisticians deeply involved in bioresearch.
So … what’s going on? Why the gross failure to randomize the samples to be sequenced? My guess, based on my many years of interacting with academic lab folks, is a massive communication and management failure, partly fed by a lack of visceral understanding of the statistics of how such experiments should be designed. Here’s my take:
- I’m guessing whoever sent the samples to be sequenced (or actually sequenced them) did not know or appreciate the purpose of the experiment (not every sequencing job requires randomization). In short, they didn’t run the samples according to a design that reflected the statistical requirements of the experiment’s objective. The latter is particularly nasty trap when there many hands between generating the samples and getting them sequenced. Remember, there are 13 authors on the Lin paper, so keeping everyone in line as to what the goal is not easy. Why the disconnect? Read on…
- It is likely that that no written analytical plan was available for consultation by the gaggle of collaborators. Instead, they were probably operating from a mass of e-mails accumulated over months or years. Furthermore, as people were added to the project the likelihood that they were forwarded key mails was probably pretty low. This kind of sophomoric communication clashes with the rigor required to get a reliable answer out of experiments involving high throughput instruments. I have frequently observed this clash in academia, manifested by no/poor usage of document repositories, no project wiki, or lack of source code control systems (rare!). For example, as part of a data submission to a public database, I had to wrestle with getting batch numbers for key reagents used in generating the data produced by a core center from a major university. I failed miserably Why? It was likely a mixture of lack of appreciation of the importance of reagent batches (sound like a potential batch effect to you?), perceived cost of changing the process, and, most likely, good old laziness. Note that this center’s database wasn’t even set up to track the protocol used to generate a given piece of data, much less reagent batches. In that particular case, it took the center’s director three years to realize that this was a problem, by his own admission. In short, if protocols weren’t being tracked, the odds clearly weren’t good for reagent batches.
- There was an obvious lack of critical supervision all along the way, only some of which I’ve described here (more later): Devising the experiment, implementing it, writing the paper, and reviewing its results (yes, the reviewers are part of the problem). Why this lack? Likely because the importance of randomization wasn’t viscerally appreciated by the folks involved, in the way that the pH of a buffer is likely to be instantly appreciated by a laboratory researcher.
How to not mess up
So how does one avoid such a catastrophic failure? Mostly, by adopting an attitude of paying attention and by being constantly critical. Beyond that, here are four things you can do to minimize the problem and increase the quality of your research product:
- Communication: For Pete’s sake, implement some kind of project wiki and keep it up to date by ferociously insisting that project members update it. This means constantly telling folks not to put important content in e-mail. Another way is to update it real-time during meetings. Unfortunately, this is a most difficult battle (e-mail is like kudzu in this respect), which is why you should get buy-in for this approach at the start of the project.
- Come up with a statistical analysis plan. Post it on the project wiki. Mail it around, refer to it in teleconferences. Doesn’t need to be long (in fact, it shouldn’t). See this template from Pfizer for inspiration. Note emphasis on objectives and confounders.
- Insist on the details when presented results. For lab results involving instruments such as sequencers, this means “what type of sequencer?”, “how many runs?”, “how long did it take to generate?”, etc.
- Insist on a detailed protocol, meaning, something you can take into the lab. Such “real” should always be provided with a manuscript submission (or at least available on a site somewhere).
Next blog: statistics with single samples!
*: Note that no information regarding the assignment of samples for sequencing was specified in the Lin paper (included its supplementary protocol description), such a third party reading this paper had no choice but to take it on faith that obvious errors weren’t made in allocating samples. One has to presume that Nature’s reviewers never saw these details either. One further hopes that perhaps one of them may have asked for this information, but I’m not holding my breath. Yes, I know, reviewing papers is an ungracious job, though the counter to that is that one does get to see results before others do, so it’s not a negligible benefit. And I won’t mention the potential to selfishly shoot down a competitor’s paper, because that never happens, of course.
2. Lin S, Lin Y, Nery JR, et al.
: Comparison of the transcriptional landscapes between human and mouse tissues
. Proc Natl Acad Sci U S A.
3. Maher, B (2012): ENCODE: The Human Encyclopaedia
, 489: 46–48 (06 September 2012).