Sunday, 1 July 2012

Dead fish, and multiple-test correction

In which a salmon is resurrected, but not enough to really be significant.  Why finding 20 positive results when your P-value threshold suggests you should only see 10 isn't necessarily anything to be excited about.  And an introduction to Bonferroni and Benjamini-Hochberg multiple test correction.

This fish isn't dead.  It's just resting.


A while ago, some mischievous psychologists decided to skewer what they perhaps considered to be some overinterpretation of functional magnetic resonance imaging (fMRI) results, in a nice poster.  Not too long after that, a manuscript appeared in the Journal of Serendipitous and Unexpected Results. That's not exactly a mainstream journal, but apparently there can be some friction from a community when you demonstrate that, using widely-applied statistical methods, it is possible to detect brain activity in a dead fish but, if you use the appropriate statistical methods, you don't.  Who'd have thought it!

In a nutshell, the issue is that fMRI makes very large numbers of individual measurements (voxels, or volume elements).  The subsequent analysis makes a test of each one to see whether or not the activity in that voxel, under a stimulus to the individual, is statistically distinct from some baseline measurement (i.e. the null hypothesis is that the activity is no different).  That might be, for example, a Z-score.

The main point is that there are very large numbers of tests being made simultaneously (in the tens to hundreds of thousands), and that they are assumed to be independent.  In a naïve analysis, some statistical threshold is applied to see whether the difference in each measurement would be expected by chance, with a particular probability (say, P<0.001), and a result with individual probability of less than this value would be considered significant: a 'positive'.

It is worth pointing out that by no means are fMRI analyses typically this naïve...

In modern biology, we often make this many simultaneous measurements, and statistical tests, simultaneously.  When we use microarrays, RNASeq, or run some kinds of proteomic and metabolomic analyses, we are making measurements analogous to fMRI voxels.  Even in genomics, when we run analyses or classifiers over a very large number of genome, protein, gene or RNA sequences, we are often making these kinds of multiple tests.

The parable of the Lazarus Salmon speaks to us all...

The problem of too many measurements.


The problem arises when you have 'too many' measurements.  Results with an individual P-value of P<0.001 would be expected to come up approximately once in every 1000 trials.  But when you have many thousands of trials, as here, you would expect to find several results with an individual P-value of P<0.001.  For 40,000 trials, you would expect - just by chance - around 40 results with this score. These truly random events aren't associated with the effects of the applied condition in the experiment, so would be rightly called false positives.   That is to say that, even with no actual difference between the baseline state and the modified state in this experiment, we would expect to see 40 or so apparently positive results, just due to random variation.  

Here, using R, I have run 1000 'experiments' where all the data is sampled from a Normal distribution with mean equal to zero, and standard deviation equal to unity.  Taking 10000 datapoints in each experiment, I counted the number of datapoints with P<0.001 (note that this is two-tailed, so the actual value in either direction is P<0.0005).

In 'biological' terms, this is equivalent to making measurements on a 'wild-type' or 'control' sample only. We're measuring 10000 things (genes, proteins, etc.) simultaneously.  The expected value of every individual measurement is zero, and the standard deviation is 1.  


> runs = 1000
> false_positives = numeric(runs)
> for (run in 1:runs) {
+   false_positives[run] <- sum(abs(rnorm(10000)) > abs(qnorm(0.0005)))
+   }
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   10.00   10.01   12.00   22.00


We can see that, on average, we expect to find 10 false positives - datapoints that appear to be significantly different from our assumption of being zero, at a P-value of 0.001.  That is exactly what we'd expect for this much random data.  For any individual experiment, though, that number was anywhere between 1 and 22:

If we were running a single - perhaps expensive - experiment with 10000 measurements, and found 22 of those measurements that appeared to be significantly different from the expected value, when the P-value suggested we should only see 10, we might get excited, as 'obviously' 12 of the results are not false positives!  We spent a lot of money perhaps, so we need these results to be meaningful.  But these data suggest we should be a little more cautious in our enthusiasm.  In particular, we need to correct for the effect of doing so many simultaneous independent experiments: we need to use multiple-test correction.

The Bonferroni Correction.


This idea is hardly new and an early, intuitive approach to solving this problem carries the name of its inventor: the Bonferroni correction.  The principle is simple: if we think that an individual test would be significant if it had probability P<p, then we just divide that probability through by the number of tests n we're carrying out.  So we just change the P-value we're looking for to P<p/n.  It's simple, but it's also a bit conservative, as we'll see.

Using the sample R code again, but applying the Bonferroni correction, we find that there are no false positives:


> p = 0.0005
> n = 10000
> runs = 1000
> false_positives = numeric(runs)
> for (run in 1:runs) {
+   false_positives[run] <- sum(abs(rnorm(n)) > abs(qnorm(p/n)))
+   }
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.001   0.000   1.000
Job done, seemingly.  But what happens when we add some 'real' data?  We do this by introducing a small number of values - here, 20 - from a Normal distribution with a different mean (equal to 3), but the same variance.

In 'biological' terms, this is equivalent to there being 20 measurements (genes, proteins, etc.) drawn from a distribution with an expected mean value that is three standard deviations away from the control value, with the same variability as the control.  This is, intuitively, a fairly significant difference.  But we have to remember that both the control and modified data have errors associated with them, so the measured value may not correspond to the mean.

> p = 0.0005
> bg_samples = 9980
> positive_samples = 20
> runs = 1000
> true_positives = numeric(runs)
> false_positives = numeric(runs)
> for (run in 1:runs) {
+   true_positives[run] <- sum(abs(rnorm(positive_samples, 3)) > abs(qnorm(p)))
+   false_positives[run] <- sum(abs(rnorm(bg_samples)) > abs(qnorm(p)))
+   }
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   6.000   8.000   7.774   9.000  15.000 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   10.00   10.04   12.00   21.00 
With no correction, we see the same number of false positives as before (10), but recover around 40% of our data (around 8/20 'true' positives, on average) from the distribution with mean equal to 3.  As with my previous post, in the presence of an appreciable false positive rate and a large number of negative examples, the base rate of positive examples is important and, typically, fewer than half of our total positive results are true positives.

Using the Bonferroni correction, we can eliminate all our false positives.  But we lose almost all of our true positives, into the bargain:

> n = 10000
> for (run in 1:runs) {
+   true_positives[run] <- sum(abs(rnorm(positive_samples, 3)) > abs(qnorm(p/n)))
+   false_positives[run] <- sum(abs(rnorm(bg_samples)) > abs(qnorm(p/n)))
+   }
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.194   0.000   3.000 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.001   0.000   1.000
That's not much use, so clearly we need something more sophisticated, which is where Benjamini-Hochberg comes in.

Benjamini and Hochberg, statistical crime fighters


The Benjamini-Hochberg method for reducing the number of false positives (paper here) is also named after its inventors.  It's a fairly intuitive approach as well, but arguably not quite as straightforward as Bonferroni's method.  We also need to introduce a new concept: the false discovery rate.

In this context, the false discovery rate is, simply, the percentage of false positives you are willing to tolerate.  If out of 100 positive results you want 95 to be 'true positives', you would set the false discovery rate to 5%.  Denoting this rate by α, we'd say that α=0.05.

The basic idea behind the Benjamini-Hochberg (BH) method is that the 'most significant' (lowest P-value) datapoints are more likely to be true positives but, as the datapoints become 'less significant' (larger P-value), they are more likely to be false positives.  We can knock up this sort of data in R:

> pos <- c(numeric(positive_samples)+1, numeric(bg_samples)) > 0
> pvals <- 1-c(pnorm(rnorm(positive_samples, 3)), pnorm(rnorm(bg_samples)))

> data <- data.frame(true_pos=pos, pvalues=pvals)
> summary(data)
  true_pos          pvalues         
 Mode :logical   Min.   :3.267e-06  
 FALSE:9980      1st Qu.:2.529e-01  
 TRUE :20        Median :5.023e-01  
 NA's :0         Mean   :5.024e-01  
                 3rd Qu.:7.504e-01  
                 Max.   :9.999e-01  
> summary(data[true_pos,])
 true_pos          pvalues         
 Mode:logical   Min.   :3.267e-06  
 TRUE:20        1st Qu.:5.516e-04  
 NA's:0         Median :1.506e-03  
                Mean   :1.331e-02  
                3rd Qu.:1.561e-02  
                Max.   :6.711e-02  
> summary(data[!true_pos,])
  true_pos          pvalues         
 Mode :logical   Min.   :2.984e-05  
 FALSE:9980      1st Qu.:2.545e-01  
 NA's :0         Median :5.030e-01  
                 Mean   :5.034e-01  
                 3rd Qu.:7.510e-01  
                 Max.   :9.999e-01
> sorted.data <- data[with(data, order(pvalues)),]

> sorted.data[1:20,]
     true_pos      pvalues
14       TRUE 3.267055e-06
2858    FALSE 2.983976e-05
15       TRUE 9.258110e-05
8        TRUE 1.255841e-04
5426    FALSE 2.368875e-04
4613    FALSE 2.987703e-04
20       TRUE 3.321229e-04
357     FALSE 4.231101e-04
3        TRUE 5.249242e-04
3861    FALSE 5.316144e-04
10       TRUE 5.605230e-04
2884    FALSE 6.157423e-04
5821    FALSE 6.676419e-04
5        TRUE 6.909282e-04
17       TRUE 8.245977e-04
1743    FALSE 8.743989e-04
5214    FALSE 9.743956e-04
8934    FALSE 9.872405e-04
1485    FALSE 1.115055e-03
6597    FALSE 1.339363e-03




We can see that the 'true positives' all have very small P-values, but overlap with the more extreme values from the background distribution (false positives).  This allows us to rank datapoints with P-values from lowest to highest, and create some kind of threshold where the number of expected false positives becomes unbearable.

In the BH method, this threshold is calculated as (αi)/n where, as before, n is the number of measurements, α is the desired false discovery rate, and i is a number determined by the ranking of the datapoint.  A couple of examples should make this clearer...

After ranking - ordering datapoints from lowest to highest P-value - we consider each point in turn.  The first point in the list has i=1, so we test it against the threshold α/n.  If the P-value is less than this threshold, we keep the result as a 'positive'; if it is greater, we stop the process.  The next point has rank 2, so we test the P-value against 2α/n; the third point is tested against 3α/n, and so on.  We stop at the first value that fails the threshold test.  This is expected to control the false discovery rate - the proportion of positives that are false positives - at α.

Implementing this in R for 1000 experiments, we can estimate the number of positives, and the false positive rate:

> alpha = 0.05
> p = 0.0005
> bg_samples = 9980
> positive_samples = 20
n = bg_samples + positive_samples
> runs = 1000
> true_positives = numeric(runs)
> false_positives = numeric(runs)
> for (run in 1:runs) {
    pos <- c(numeric(positive_samples)+1, numeric(bg_samples)) > 0
    pvals <- 1-c(pnorm(rnorm(positive_samples, 3)), pnorm(rnorm(bg_samples)))
    data <- data.frame(true_pos=pos, pvalues=pvals)
    data$prank <- rank(data$pvalues)
    data$thresh <- data$prank*alpha/n
    data$positive <- data$pvalues < data$thresh
    true_positives[run] <- nrow(data[data$positive & data$true_pos,])
    false_positives[run] <- nrow(data[data$positive & !data$true_pos,])
+   }
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.544   4.000  10.000 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.183   0.000   3.000 

So we seem to have increased the number of true positives to almost 3 per experiment, and to have kept the false positive rate down, almost to zero.  Increasing α to 0.25 and then to 0.5 shows we can improve on the Bonferroni correction by progressively identifying more true positives, but at the expense of increasing the proportion of false positives:

> alpha = 0.25
[...]
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.000   6.000   6.039   8.000  15.000 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    2.00    2.55    4.00   18.00 
> alpha = 0.5
[...]
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00    9.00    9.07   11.00   18.00 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00    9.00   10.67   15.00   46.00 



Our overall recovery of positive examples isn't always very good, but we can see that we are, at least, controlling the proportion of false positives.

Our examples so far have had a pretty weak signal: only 20 of 10000 measurements are actually drawn from a different distribution to the background.  What happens if we increase the number of positives to 1000, with 9000 negative examples?

> bg_samples = 9000

> positive_samples = 1000
> alpha = 0.05
[...]
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  542.0   592.0   606.0   606.4   620.0   670.0 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.00   25.00   29.00   28.61   32.00   46.00

We reliably recover around 60% of true positives, but only around 30 false positives.  Contrast this with the uncorrected results:

> for (run in 1:runs) {
+   true_positives[run] <- sum(abs(rnorm(positive_samples, 3)) > abs(qnorm(p)))
+   false_positives[run] <- sum(abs(rnorm(bg_samples)) > abs(qnorm(p)))
+ }
> summary(true_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  339.0   377.0   387.0   386.3   397.0   437.0 
> summary(false_positives)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   7.000   9.000   8.975  11.000  18.000 
More false positives sure, but more false negatives, too.  In this example, at least, using BH multiple test correction has improved the sensitivity of the analysis.


More methods...


The Bonferroni and BH methods for multiple test correction are pretty simple to apply, with the trade-off that they make assumptions that lead to rather conservative selection of positive examples - Bonferroni more than BH.  These assumptions might not apply to your data.  

For instance, both Bonferroni and BH (in this form) assume that all your tests are independent.  This is clearly not always true for biological data.  The expression level of one gene may depend on the expression levels of other genes that are also being tested in that experiment, for example.  In that case, the Benjamini-Hochberg-Yekutieli procedure can be used, which is a straightforward extension of the BH method that accounts for correlation between variables.

In the fMRI example above (and for several other kinds of biological data) there may be spatial structure to the data, so that datapoints that are 'near each other' in the sample in some way are correlated.  Spatial data may be improved by smoothing, to give higher signal-to-noise, and this introduces dependence to the data, as information from neighbouring datapoints is used to modify the value at each smoothed point.  The smoothed data can then be used to identify resels (resolution elements), and the number of expected resels at a particular confidence level estimated using Gaussian Field Theory.

There are several other methods out there for multiple test correction, and one of them might be more appropriate to your data than any of those listed above.  There are also other techniques that can be used to help give more confidence in your results.

Conclusions


It's entirely possible that, using any of these methods, we could be throwing away good positive results.  Certainly, if the signal is weak (as in most of our examples above), and we want to keep false positives down, we might well end up losing the vast majority of true positive datapoints.  It's a valid concern, but not always valid enough to justify ignoring the problem and failing to correct for multiple testing. As with all experiments, there's a trade-off (often financial) between focusing on the most 'certain' data, and being as inclusive as possible.

The key point in all this is that, if we use large-scale experiments as attempts to get at truth in nature, then the non-intuitive nature of large datasets can easily get the better of us, and we need to use appropriate statistical methods to avoid being misled by the data.  If we don't use these techniques, we could end up, metaphorically, chasing dead fish.





No comments:

Post a Comment