Four things you might not know (but should) about false discovery rate control

Useful advice on controlling the False Discovery Rate (FDR)!

Spikes & Waves

Massive increases in the amount of data scientists are able to acquire and analyze over the past two decades have driven the development of new statistical tools that can better deal with the challenges of “big data.” One such set of tools is ways of controlling the “false discovery rate” (FDR) in a set of statistical tests. FDR is simply the mean proportion of statistically significant test results that are really false positives.  As you may recall from your introductory statistics course, when you perform multiple statistical tests, the probability of false positive results rapidly increases.  For example, if one were to perform a single test using an alpha level of 5% and there truly is no effect of the factor being tested, there is only a 5% chance of a false positive result.  However, if one were to perform 10 tests using an alpha level of 5% for each…

View original post 1,097 more words

Four things you might not know (but should) about false discovery rate control

CLINVITAE: A nice integration of clinically-observed genetic variants

If you’ve ever delved in the universe of genetic variants and their potential clinical effects, you most likely have quickly realized that there are lots of databases that store this type of data.

However, and as usual, the contents of these databases are all over the place with respect to how they present their data and the terminology the use. This makes the process of obtaining a single, integrated view of all variants observed in humans a difficult enterprise, even when restricting oneself to public sources.

The good folks at Invitae have gone a fairly long way toward addressing these issues by providing CLINVITAE, a search engine that operates on a large number of variant sources (see list below). There’s no way to know the fraction of the known variant space CLINVITAE covers, but it’s clearly a lot.

The search engine is very friendly (basically modeled on Google’s), making it quite suitable for one-off queries. Furthermore, they also allow downloading of the entire database. As a computational biologist, I can tell you that this is a very nice service indeed, and I salute the company for doing so!

Invitae's public database of human genetic variants

Sources of human genetic variants stored in CLINVITAE:

CLINVITAE: A nice integration of clinically-observed genetic variants

Statistics with single samples!

In the second post of the series on HOW NOT TO SCREW UP BIOCOMPUTATIONAL RESEARCH, I continue with this great example of epic failure, Lin et al. (2015)1. Below I address another rather amazing, yet completely avoidable deficiency of the paper: relying on single samples.

Recall that Lin et al. are trying to make the argument that the expression of genes in e.g., brain, is more influenced by the species (mouse or human) than the tissue. In other words, they claim that human-mouse brain differences in gene expression are greater than e.g., mouse brain-liver and human brain-liver.

numbers matter

The first hint of something very wrong with this paper is that … there is no table detailing the numbers of samples. I would expect, and it is very common, to see a supplementary table that says something like “mouse brain: 5 samples; human brain: 6 samples” or some such. But nary a table is to found.

The only place where the number of samples per tissue and species might be INFERRED is under the “Noncoding Transcript Analysis” heading in the Supplementary Information section. I say “inferred”, because of course the section seems to pertain only to noncoding transcripts, so what about the ones that do code? One is left to … infer … that the same samples are probably being used throughout.

In short, nowhere in the paper is there a definitive statement about the number of samples per tissue. For a given species, they seem to be reporting results from more or less one sample per tissue. Yup, one. Think of this: That is saying that the universe of variance associated with the expression of genes in organ X across the organisms that form the species is properly represented in the organ of a single member of that species. Does anyone recall the notion from basic stats whereby the mean is never to be found in a distribution of actual data? How about the wisdom of drawing conclusions from N=1 samples?

Since the paper is making assertions at the level of entire species, and therefore needs to cover a broad spectrum of factors such as the age and sex of the sample donors, and since the paper is heavily statistically motivated, you would think that detailing the properties of the samples would be crucial to provide a description of whatever was attempted toward capturing that variance.

This is even more important for the human samples, since they were obtained from deceased individuals at (presumably) wildly different ages, a very different situation from the mouse samples (all sacrificed at 10 weeks). Thus, no table for sex, age distribution, and cause of death for the human samples. Imagine that.

batch effects rear their heads again

Now, let’s think whether any of this might cause a batch effect. Hum, all of the mice (which are clones, btw) are at 10 weeks of age, whereas the humans can be expected to be all over the place, although we don’t know, ’cause, you know, that’s just not important. Furthermore, let’s see, did they control for sex? Well, no. They don’t tell us anything about sex. I guess it’s not something known to have huge impact on the organism in all sorts of ways, right?

So let’s summarize:

  1. single sample per organ and species
  2. no control for age
  3. no control for sex

By “control”, I mean an attempt to defuse a potential source of batch effects. For example, sex-balanced sampling from a set of samples with a range of ages.

Some might argue that controlling for the above biases would have involved a lot more sequencing, to the point of rendering the experiment economically unfeasible. Putting aside that this argument implies that it is OK to publish very weak results, that is not necessarily the case. There are designs that would have involved exactly the same amount of sequencing. For example, for a given organ and species, they could have extracted samples from a range of donors, measured ribosomal RNA concentrations from each, mixed those samples in a 1:1 molar equivalent manner, and sequenced a single library made from those mixes. It’s not ideal in various ways, but it wouldn’t have involved more sequencing and it would have provided a truer reflection of the variance of gene expression for that organ.

Incidentally, the converse scenario (a wide range of individual samples) provides the benefit of sequencing many more samples, such that they would have been able to control for sequencing sample batch effects, since they would have had more than just 13 samples, thus addressing a comment made by Lin et al. that “No study design given the current constraints of multiplexing and lane organization can account for both primer index and lane effect simultaneously” (see comment section in Gilad and Mizrahi-Man’s re-analysis paper).

so how did this get published?

One might wonder as to how one gets something like this published. In other words, how could the reviewers let this pass?

Cutting to the chase, the only plausible explanation as to how this was published is that the authors were given a pass by some (all?) of the reviewers and by the editor (whose function it is to make sure the reviewers are doing their job). It is likely that such a pass might not have been forthcoming for less prominent researchers from lower profile universities than Stanford.

Folks,let’s remember that this such massive failures are not consequence-free. This “research” was paid mostly using our tax money. Taxpayers are entitled to expectations as to how these funds are being spent.

Beyond the dollars, let’s also bear in mind that whatever scientific benefit might have ensued if funding had gone elsewhere is now lost, since funding is a zero-sum game: dollars spent on X cannot be spent on Y.

And that’s not even addressing the human cost that results from the introduction of such noise into the literature. We are trying to understand life and cure disease here, and that task is not helped when finite resources are being consumed in this way.

How to not mess-up

So how does one help prevent the problem? Here are three things you can do to minimize the problem and increase the quality of research products, whether yours or others’:

  1. as a scientist: when reading a paper, start with the Methods section first. It should contain a wealth of details and specifications. If it doesn’t, something is potentially fish. Good papers have a good Methods section — simple as that.
  2. as a reviewer: insist on basic details in the Methods and Results sections. Basic questions should always be addressed (how many, what type, source). This is especially true for reagents. That includes insisting on catalog numbers for the latter, something which journals appear to increasingly request.
  3. as a lab director: follow the advice in (2). Make sure you ask, receive, and understand the details. It’s all details!


1. Lin S, Lin Y, Nery JR, et al.: Comparison of the transcriptional landscapes between human and mouse tissuesProc Natl Acad Sci U S A. 2014; 111(48): 17224–17229.

Statistics with single samples!

Another take on batch problems in Lin et al., (2015)

In a recent blog post, I addressed about the rather catastrophic (and totally avoidable) batch problems encountered by Lin et al. (2015) in their paper that investigated whether the source organ has greater influence on gene expression than the species.

Rafael Irizarry has written a niece piece on this specific topic (also focussing on batch problems in Lin et al.) as part of his very helpful simplystats blog. It touches on many of the same messages as my write-up, but is more statistical in nature, and includes a niece pointer to an online introductory linear modeling course from Harvard. Well worth reading, especially if you need a refresher on the math.

Also, do check out Rafael’s book, Bioinformatics and Computational Biology Solutions Using R and Bioconductor. This was at favorite at Stanford (I know this in several different ways!) A bit old (2005), but still very useful.

Another take on batch problems in Lin et al., (2015)

Hello batch effect!

The first post in this series begins with deconstructing two influential papers with respect to statistics, bioinformatics and organizational behavior. Why? Because these papers provide an awesome learning opportunity to avoid many of the ills of academic bioscience, as observed by Yours Truly.

In these posts I’ll be expounding on how to AVOID MESSING UP when analyzing data in ways so bad that your career becomes deeply compromised, your PI hates you, your university wishes it had never heard of you, and future funding becomes questionable at best. So, below is the first installment. Comments welcome!

The bad JUJU of Batch Effects

Recently, Gilad and Mizrahi-Man (“G&M” from hereon) published a fascinating re-analysis1 of the data underlying the “surprising” findings from Lin et al.2, researchers in the Snyder Lab at Stanford University. The “surprise” involved the discovery that the expression of genes across species appeared to be more heavily influenced by the species than by the source tissue. As stated in the Lin paper:

“Overall, our results indicate that for the human–mouse comparison, tissues appear more similar to one another within the same species than to the comparable organs of other species when examining a more complete set of tissue types.”

Now, G&M used the word “surprising” because several studies had reported the converse finding: gene expression tends to cluster by tissue, not species. That is, they claim that the influence of the species is greater than that of the tissue, which is weird. Given that the Lin paper was published under the aegis of the massive ENCODE project (~$400M3), such a finding was sure to be influential and receive considerable attention.

And attention it got, wherein G&M decided to dig into Lin et al.’s data to try to eliminate the possibility of alternative competing explanations for the finding. It is one of the marvels of the age that one is now increasingly able to obtain raw data and reverse engineer a paper’s analysis, though much too rarely so far (more on this topic later).

What G&M found was a whole goat rodeo of problems, many of a highly sophomoric nature, ranging from basic statistical errors to poor programming. Consequently, and at a minimum, the original findings of Lin et al. must now be substantially discounted until further validation.

Working from the original FASTQ sequencing files generated by the RNA-Seq process, G&M were able to reconstruct the sequencing study design from sequence identifiers present in those files (thank God for digital biology)*. This allowed them to determine when the samples were sequenced, on which machine, and in which lanes, yielding the sample allocation design depicted in Figure 1. This is reproduced from the G&M paper and shows that samples were sequenced in multiple batches, BUT that only one batch (out of five) included samples from BOTH species.

At this point, you should start getting nervous, because this design fails to control for one of the most basic sources of confounding influence, namely, batch effects. In statistics, a confounder is an extraneous variable in a statistical model (e.g., a linear regression) that correlates with both the dependent variable (here, gene expression) and the independent variable (here, the samples from different tissues and species).

Figure 1: Original sequencing plan as inferred by Gilad and Mizrahi-Man, 20151. Only one flow cell is used to sequence a combination of human and mouse samples. Such a design creates well-known, and highly confounding, batch effects. I’ve added the red rectangle to highly the different instruments.

Lin et al. sequencing design

A more rigorous definition is provided in a seminal review by Leek et al. entitled “Tackling the widespread and critical impact of batch effects in high-throughput data4 (sound relevant?). Published five years ago, and with 420 times citations as of this writing according to Google Scholar (read: VERY INFLUENTIAL), it states that “Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study.” In other words, part of the observed variation in a signal is not due to the natural phenomenon under investigation, but rather to the experiment design.

Now, batch effects can creep in even with mindful experiment design, but in the Lin paper, there was no attempt to minimize biases resulting from the non-random grouping of samples. A correct design so would have involved randomly assigning samples to lanes and sequencers to minimize biases that can result from how a sequencer or lane performed one day to the next. Such biases are not restricted to sequencers per se, but include anything that pertains to generating the data (the operator, temperature in the room, day of the week, etc, etc). The general principle here: RANDOMIZATION IS YOUR FRIEND. Always remember that.

The funny thing is, batch effects are not a new problem, and they aren’t specific to high throughput instruments, as Leek et al. point out:

“Although batch effects are difficult or impossible to detect in low-dimensional assays, high throughput technologies provide enough data to detect and even remove them. However, if not properly dealt with, these effects can have a particularly strong and pervasive impact. Specific examples have been documented in published studies in which the biological variables were extremely correlated with technical variables, which subsequently led to serious concerns about the validity of the biological conclusions”. (emphasis added)

I include the bit about published experiments failing to correct for batch effects because such a failure definitely seems to be at play in the Lin paper. An interesting side bar here is that if your data becomes publicly available (as is increasingly the norm), AND there is enough of it (e.g., you’re using a high throughput instrument), any random researcher will have a good chance of determining whether you have messed up. Scary, huh? Let that provide motivation for doing it right, ’cause your sins may be plastered for all to see. More on that, below.

So, back to the re-analysis: G&M corrected for sequencing sample confounding by using the ‘ComBat’ function from the sva package5 (BioConductor is a suite of R packages for biomedical data analysis). This function was used to condition the data for inspection using Principal Component Analysis (PCA; example from here). As further explained by G&M:

“To visualize the data, we used the function ‘prcomp’ (with the ‘scale’ and ‘center’ options set to TRUE) to perform principal component analysis (PCA) of the transposed log-transformed matrix of ‘clean’ values (after removal of invariant columns, i.e. genes), and the ggplot2 package to generate scatter plots of the PCA results. None of the first five principal components (accounting together for 56% of the variability in the data) support the clustering of the gene expression data by species”. (emphasis added)

In short, a re-analysis that accounted for the original sample allocation showed that more than half the variance observed was not due to the species, thus dramatically weakening the purported “species over tissue” effect. It’s worth noting that PCA, a widely used technique, was used by Lin et al. themselves in their paper.

Thus, it sure looks like Lin et al. fell into the trap described by Leek et al. Now, you have to wonder about this, ’cause there is nothing subtle going on. Batch effects have been analyzed, published, cited, and discussed for years, and it’s not like Stanford is devoid of world-class statisticians deeply involved in bioresearch.

So … what’s going on? Why the gross failure to randomize the samples to be sequenced? My guess, based on my many years of interacting with academic lab folks, is a massive communication and management failure, partly fed by a lack of visceral understanding of the statistics of how such experiments should be designed. Here’s my take:

  1. I’m guessing whoever sent the samples to be sequenced (or actually sequenced them) did not know or appreciate the purpose of the experiment (not every sequencing job requires randomization). In short, they didn’t run the samples according to a design that reflected the statistical requirements of the experiment’s objective. The latter is particularly nasty trap when there many hands between generating the samples and getting them sequenced. Remember, there are 13 authors on the Lin paper, so keeping everyone in line as to what the goal is not easy. Why the disconnect? Read on…
  2. It is likely that that no written analytical plan was available for consultation by the gaggle of collaborators. Instead, they were probably operating from a mass of e-mails accumulated over months or years. Furthermore, as people were added to the project the likelihood that they were forwarded key mails was probably pretty low. This kind of sophomoric communication clashes with the rigor required to get a reliable answer out of experiments involving high throughput instruments. I have frequently observed this clash in academia, manifested by no/poor usage of document repositories, no project wiki, or lack of source code control systems (rare!). For example, as part of a data submission to a public database, I had to wrestle with getting batch numbers for key reagents used in generating the data produced by a core center from a major university. I failed miserably Why? It was likely a mixture of lack of appreciation of the importance of reagent batches (sound like a potential batch effect to you?), perceived cost of changing the process, and, most likely, good old laziness. Note that this center’s database wasn’t even set up to track the protocol used to generate a given piece of data, much less reagent batches. In that particular case, it took the center’s director three years to realize that this was a problem, by his own admission. In short, if protocols weren’t being tracked, the odds clearly weren’t good for reagent batches.
  3. There was an obvious lack of critical supervision all along the way, only some of which I’ve described here (more later): Devising the experiment, implementing it, writing the paper, and reviewing its results (yes, the reviewers are part of the problem). Why this lack? Likely because the importance of randomization wasn’t viscerally appreciated by the folks involved, in the way that the pH of a buffer is likely to be instantly appreciated by a laboratory researcher.

How to not mess up

So how does one avoid such a catastrophic failure? Mostly, by adopting an attitude of paying attention and by being constantly critical. Beyond that, here are four things you can do to minimize the problem and increase the quality of your research product:

  1. Communication: For Pete’s sake, implement some kind of project wiki and keep it up to date by ferociously insisting that project members update it. This means constantly telling folks not to put important content in e-mail. Another way is to update it real-time during meetings. Unfortunately, this is a most difficult battle (e-mail is like kudzu in this respect), which is why you should get buy-in for this approach at the start of the project.
  2. Come up with a statistical analysis plan. Post it on the project wiki. Mail it around, refer to it in teleconferences. Doesn’t need to be long (in fact, it shouldn’t). See this template from Pfizer for inspiration. Note emphasis on objectives and confounders.
  3. Insist on the details when presented results. For lab results involving instruments such as sequencers, this means “what type of sequencer?”, “how many runs?”, “how long did it take to generate?”, etc.
  4. Insist on a detailed protocol, meaning, something you can take into the lab. Such “real” should always be provided with a manuscript submission (or at least available on a site somewhere).

Next blog: statistics with single samples!

*: Note that no information regarding the assignment of samples for sequencing was specified in the Lin paper (included its supplementary protocol description), such a third party reading this paper had no choice but to take it on faith that obvious errors weren’t made in allocating samples. One has to presume that Nature’s reviewers never saw these details either. One further hopes that perhaps one of them may have asked for this information, but I’m not holding my breath. Yes, I know, reviewing papers is an ungracious job, though the counter to that is that one does get to see results before others do, so it’s not a negligible benefit. And I won’t mention the potential to selfishly shoot down a competitor’s paper, because that never happens, of course.


Hello batch effect!