In which an oft-overlooked bit of genome-mining statistics is considered, and your enjoyment of a holiday could depend heavily on other people's hygiene.

Last week I had the pleasure of giving a presentation about mining pathogen genomes for effector proteins, and helping to train a group of young plant pathology researchers from around the world, as part of an EMBO training course (more about the course here and here). One of the points I wanted to get across in my presentation was that bioinformatic prediction methods - specifically functional classifiers - are rarely absolute, black-and-white, yes/no indicators. I also wanted to show that the uncertainty involved in those predictions can be quantified, and that an appreciation of some simple statistics can help avoid frustration with those tools, and help guide and improve experiments that use those predictions.

### The Base Rate Fallacy

One of the statistical issues I brought up was the effect of the base rate fallacy on interpretation of functional prediction. We can frame this intutitively in the form of a medical test:

Imagine that you've gone to your family doctor for a routine checkup, and as part of that examination, the doctor performs a test for Disease X. This disease is unsightly and embarrassing, and appears only when you wear bathing clothes, and so is likely to ruin your beach holiday.

The test has pretty good diagnostic capability. It has a 5% false negative rate (FNR). That is, in a line-up of 100 people who have Disease X, we expect the test to give a correct positive result for 95 of them. It also has a 1% false positive rate. That is, in a similar line-up of 100 people who don't have the disease it wrongly gives a positive result only for one of them.

Your test result comes back positive. Unlucky. Time to buy a kaftan for your holiday.

But all is not lost. You might yet be able to keep those Primark Kaftan vouchers for another day. What you don't know yet is the probability that, given that your test result is positive, you actually have Disease X. And for this you need to know the base rate: what the prevalence of this disease is in the population as a whole.

You can think of it like this: if nearly all people have Disease X, then most people taking the test will have the disease, and the 'dominating' statistic for the test's performance is the false negative rate. However, if hardly any of the population have Disease X, most people being tested will not be positive, and the 'dominating' statistic is the false positive rate.

Most importantly, if the proportion of the population without the disease is small enough, then positive results will overwhelmingly come from people without the disease.

That might seem counter-intuitive, but we can illustrate it with some numbers.

Say we have a population of one million people, and the majority of the population - 80% - have Disease X. Then we can assume that 80% of people taking the test have the disease.

- If 1000 people take the test, we expect 800 of them to have the disease.
- 5% of these - 800 x 0.05 = 40 - will receive a false negative result; 760 will receive a correct positive result.
- 200 people taking the test will not have the disease. But 1% of these people - 200 x 0.01 = 2 - will receive a positive result.

On the other hand if only a small minority, say 1% of the population, have Disease X, we can assume that only 1% of people taking the test have the disease.

- If 1000 people take the test, we expect 10 of them to have the disease
- 95% of these people - 10 x 0.95 = 9.5 (essentially all) - will receive a positive result.
- But there are 990 people without the disease taking the test, and 1% of these people: 990 x 0.01 = 9.9 - will wrongly receive a positive result.

So, with such a small base rate of the disease in the population, we expect approximately 20 positive test results out of 1000 tests, and half of these will be false positives. In particular, the probability that you have Disease X is around 10/20 = 0.5. That's practically a coin toss.

Now, if we tested the whole population, with a base rate of 1% for the disease, we would expect 9500 true positive results, and 9900 false positives. Again, pretty much a 50:50 chance. If we don't change the base rate, it doesn't matter how large or small a sample we take.

This kind of consideration has immense impact on medical (and genome) screening applications, and is a source of much controversy in emotive areas. If a test for a rare disease essentially gives you a coin toss probability for having Disease X, would you risk a life-threatening or expensive treatment to avoid it, in the absence of other evidence? More to the point in some countries, would your health service or insurer pay for it on those grounds?

The same statistical considerations apply to any screening procedure that involves such a classifier, including the less controversial case of screening genomes for effectors. If, instead of a medical test for Disease X, we're looking for a sequence motif that might be diagnostic for an effector (a type III secretion motif or RxLR, for example), and we know the predictive performance of our classifier, we still need to know the base rate of effector occurrence to interpret our positive results.

In most cases we don't know the true effector count for an organism, since they're predicted as effector candidates using the very classifiers we're talking about, which is somewhat circular. But in some well-studied systems (e.g.

*Pseudomonas syringae*) we've got a very good idea, with supporting experimental evidence in all - or nearly all - cases. Regardless of the extent of support, the base rate of occurrence of any given effector class is quite low: on the order of 1%. This has implications for our interpretation of classifier results.### An example from the literature

Without wanting to pick on any method in particular, a useful case is the paper from Arnold

*et al.*in which they evaluate their type III secretion signal classifier for EffectiveT3. They describe their procedure well, and apply it to over 700 bacterial and archaeal genomes, attempting to classify every predicted gene product in each of those genomes. Their exemplary reporting of the results of the wholesale screening of so many complete genomes with their method is what makes them a good example, here.
Their method is reported to have a 71% true positive rate, and a 15% false positive rate. Assuming a 3% base rate of effector occurrence (which is perhaps a little high...), the probability that any individual positive prediction corresponds to a type III effector turns out to be 0.13: about 1 in 10. That seems low, but this is borne out in their supplementary data (Table S11), in which they make large numbers of positive predictions for many bacteria that lack a type III secretion system altogether, and so should not possess type III effector sequences at all. We would expect every positive result in these organisms to be a false positive.

The message from this example is that, in a screening search that makes a prediction for every gene product in the genome, the sheer volume of false positive sequences swamps out the (otherwise quite accurate) positive signal for real effectors. The authors themselves note that:

The surprisingly high number of (false) positives in genomes without TTSS exceeds the expected false positive rate[...] and thus raised questions about their nature. Manual inspection of positive predictions in Gram-positive bacteria revealed many cases of wrongly annotated gene starts (having N-terminal elongations and thus contain fractions of the intergenic space) or questionable genes without any homologs in other genomes (ORFans).

though their analysis goes on (reasonably) to speculate about the role of misannotation, a more prosaic explanation could be the impact of the base rate fallacy, when screening a large set of candidates with expected low occurrence of positives.

### What can we do about it?

It's typically rather difficult to produce classifiers with much better performance than Arnold

*et al.*'s method, and the base rate problem exists for all of us when screening genomes, so how can we get around this problem?
One way is to reduce the number of known negative examples we present to the classifier, so reducing the potential for false positives. Effector proteins are unlikely to take part in bacterial core metabolism, so that allows us to remove around 1000-1500 genes with predicted function from the analysis. That could be up to half of a small bacterial genome, raising an expected effector base rate from 3% to 6% almost immediately.

We might not expect effectors to contain transmembrane domains, or to be transporters, or to have other (fairly reliably) annotated biochemical functions that would seem to contradict a role being translocated into the host plant. By excluding these sequences, we can increase the expected base rate further.

We can also apply positive filters. For RxLRs we may exclude from consideration any sequence that does not have a predicted signal peptide (indeed, that was a typical filter in the

*P. infestans*RxLR predictions).
In effect, we want to do everything we can to increase the expected base rate of effectors in the set of screened candidates, to avoid falling prey to swamping the real positive signal for an effector with noise from an overwhelming number of negative examples.

If we raised the base rate for effectors to 25% with these approaches, we would still expect a classifier that performs as well as the Arnold

*et al.*method to give a considerable number of false positives, but far fewer than before, and the probability that a positive result corresponds to a 'real' effector would rise to over 0.6.### In conclusion

The lesson here is that screening large gene complements of genomes is not in your favour if you are using a classifier to identify sequences that are expected to be a very small minority of the genome. Even if your classification method is quite strong, the sheer number of sequences that do not belong to your sequence class can easily swamp detection of true positive candidates, and corresponding care has to be taken when conducting whole genome screens.

## No comments:

## Post a Comment