Sunday 10 November 2013

ANI are you okay? Are you okay ANI?

In which I describe Average Nucleotide Identity (ANI), which we can use to pigeonhole bacteria into conceptual boxes labelled as 'species'. And I share more code.


Defining bacterial species

Bacterial species status (or any species classification) can be a tricky thing to pin down. The kind of definition that is easy to use and intuitively familiar to most people for animals and plants (essentially, if the members of two populations have sex will there be offspring? If so, they're the same species) is a reproductive definition, but bacteria don't really have sex and reproduce in a way that is familiar to most people. As a result, we need a different kind of definition if we're going to talk about bacterial species at all.
Rua do Instituto Bacteriologico, Lisbon

Happily, there are many other suggestions for how we might define species, though they don't all necessarily agree with each other. Since genetic material, i.e. DNA, is typically more similar between related individuals - those that share a more recent common ancestor - we can use DNA similarity as a measure of 'phylogenetic species'. DNA-DNA hybridisation (DDH) has been used for about half a century as a practical way to measure this similarity. But this is a physical experiment, and only an approximate measure, whereas new DNA sequencing technologies enable us to read the entire genomes of bacteria quickly, and therefore calculate the similarity of two individuals' DNA with much greater accuracy, and often quite cheaply, too. So it is now practical to use computational comparisons of bacterial genome sequences to determine whether they are the same species.

Over the years, some rules of thumb have evolved for the amount of genetic similarity that implies that two bacteria belong to the same species. Though, to be honest, I much prefer the pan-genome concept as an organising principle. For DNA-DNA hybridisation this is often quoted as "70% DNA similarity" (e.g. here, and here). But if we're using bacterial genome sequences, we can use a precise level of DNA similarity, the Average Nucleotide Identity (ANI). A couple of years ago, it was determined that the overall DNA sequence similarity between two genome sequences at the boundaries between previously accepted species is about 95-96%.

But there's more than one way to measure sequence similarity. As usual.

Defining ANI

In their paper "DNA-DNA hybridization values and their relationship to whole-genome sequence similarities." (doi:10.1099/ijs.0.64483-0), Goris et al. introduced ANI as a substitute for DDH when determining prokaryotic species boundaries. But they used a method of determining ANI that struck me as odd. In their implementation, they slice one of the genomes (the query genome) into consecutive 1020nt long fragments. This was to keep some kind of - to me, unnecessary - accordance with the practical application of DDH, which also fragments the query genome into similar lengths (but not in the same way!). Since their method discards fragments with less than 30% sequence identity over 70% of their length, this makes it possible that (small) sections of the genome that match exactly may not contribute to the ANI score. Also, repeat regions in the query genome may count twice towards this estimate. Both of these concerns mean that we're not really measuring the overall level of similarity, therefore a measure of sequence similarity that counts all uniquely matching regions between two genome sequences would seem a little more reasonable.
Figure 1 from Goris et al. describing the observed relationship between DDH and ANIb

The 2009 paper "Shifting the genomic gold standard for the prokaryotic species definition" (doi:10.1073/pnas.0906412106) by Richter and Rosselló-Móra adapts the ANI method using the MUMmer alignment software to do exactly that. They call this version of the algorithm ANIm, distinguishing it from the original Goris et al. method, which they call ANIb (so named because it uses the BLAST package for alignment). They concluded that the two methods were broadly equivalent, especially when sequence similarity was very high. However, ANIm was more stringent in the selection of genome sequence regions for comparison, leading to some differences in classification.
Figure 1 from Richter and Rosselló-Móra (2009), showing agreement between ANIb and ANIm methods
ANIb and ANIm methods were seen to be broadly equivalent, but both approaches still require sequence alignment. This means that a decision, however clear-cut and automatable that might be, has to be made concerning what parts of the genomes are equivalent, and suitable for comparison. It might be better therefore to use an alignment-free method that considers all the genome sequence information, without bias.
Figure 2 from Richter and Rosselló-Móra (2009) showing concordance between DDH and ANIm
Richter and Rosselló-Móra propose such a method in their 2009 paper, based on tetranucleotide frequencies (TETRA). These are useful when attempting to determine whether two sequences may have come from the same organism in a more general sense, and are often employed to help identify contaminant sequence in genome assemblies, and it's a natural extension of that to use it formally for species identification. But the relationship between ANI and TETRA is not straightforward (not least because their technique actually measures correlation between statistical deviations of tetranucleotide frequency, rather than the frequencies themselves), and the results are fuzzy and still need to be interpreted in conjunction with the overall sequence similarity. In particular, a very high TETRA score may be accompanied with an ANIm that falls below the species boundary cutoff.
Figure 3 from Richter and Rosselló-Móra (2009) showing the nonlinear relationship between ANIm and TETRA
I would interpret this as TETRA reflecting a bulk property of the genome, based on overall composition, rather than genetic information preserved through historical evolutionary processes, which is what ANIm and ANIb are detecting.

Calculating ANI

Richter and Rosselló-Móra provide the JSpecies software, which is written in Java, GUI-based, and provides some summary graphics, in addition to ANIb, TETRA, and ANIm output. But I wanted something I could bung into a pipeline, and which would produce the kind of heatmap output that I wanted. So I wrote a script (calculate_ani.py) in Python, which I've made available at GitHub.

To use the script calculate_ani.py, you pass it a directory containing the sequences you want to compare (one FASTA file per isolate), and it returns a plain text tab-separated table of the ANIb/ANIm/TETRA scores - as requested. Optionally, the script will also return a heatmap and dendrogram of the results. It requires Biopython, and local installations of BLAST+ (for ANIb) and MUMmer (for ANIm); for graphical output R and Rpy2 are also required. It uses Python's multiprocessing module to parallelise the BLAST and/or MUMmer searches for input organisms, and speed things up. Output, including intermediate files (for reproducibility), are written to a single output directory. The verbose output is intended to be verbose enough to be useful for reproducible research, and a logfile for output can be specified on the command-line.

For demonstration purposes, we can download the current set of Chlorobium genomes. Setting up a filelist in chlorobium.txt:

These correspond to C.chlorochromatii (NC_007514), C.limicola (NC_010803), C.luteolum (NC_007512), C.phaeobacteroides (NC_010831 and NC_008639), C.phaeovibrioides (NC_009337), and C.tepidum (NC_002932).we can download the set of genomes for this genus with wget:

and then apply the calculate_ani.py script:
On my laptop, for this dataset, ANIb takes about 26s, while ANIm and TETRA each take around 36s of userspace time. Since we've got two members of C.phaeobacteroides, we'd expect these to have ANIb and ANIm scores over 0.95, and a high TETRA score. What do we get?

ANIb results for Chlorobium isolates: scores below 0.95 suggest that all are distinct species.

TETRA results for Chlorobium isolates: the isolates recorded as C.phaeobacteroides and C.limicola show the strongest TETRA scores, but without ANIb/ANIm support, we can't say they're the same species.

ANIm results for Chlorobium isolates: scores below 0.95 suggest that all are distinct species.

Oh dear. Not only do the scores seem to indicate that none of these isolates belong to the same species, we don't even find that the two C.phaeobacteroides isolates group together by anything other than TETRA. I know nothing about Chlorobium phylogenetics, but I'm not hugely surprised, as misnaming of bacterial species in GenBank is known elsewhere (e.g. three of four GenBank records in doi:10.1111/j.1365-3059.2012.02678.x).

It would be nice to have an example that confirms the species relationships assigned in GenBank, so let's try Anaplasma (again, I know nothing about these organisms). Here's the file list:
so there are three A.marginale (NC_022760, NC_012026, NC_004842), and four A.phagocytophilum (NC_021881, NC_021897, NC_007797, NC_021880), with one A.centrale (NC_013532). So we'd expect three groupings, by species.

This looks a bit better than before:

ANIb results for Anaplasma isolates. Despite the A.phagocytophilum/A.marginale+A.centrale split, the scores seem to support A.marginale as a species, but not much else.

ANIm results for Anaplasma isolates. These results support the species classifications as recorded in GenBank.

TETRA results for Anaplasma isolates. Again, the A.marginale+A.centrale/A.phagocytophilum split is supported, but no separation between A.marginale and A.centrale is seen.
we're getting ANIm results that support the GenBank records, though ANIb seems a bit more equivocal - possibly because it runs the risk of discarding some useful genome information, as it's implemented. Again, our results suggest that the TETRA method taken in isolation is likely to lead to false positives, as a high TETRA correlation score can result between isolates that don't have a strong ANI score.

Knock yourselves out

Currently, the script does not offer as many options for modifying the underlying search algorithm parameters as does JSpecies, but the method is fairly stable to those choices, and it would be possible to add these options if needed. Additions and improvements to the code are always welcome. It's all at GitHub, so please fork and edit away, and please send a pull request with your changes, if you have a bash at it.

11 comments:

  1. Hello,
    I am trying to run this script but I am facing some problems and getting an error. Could you provide some help with this?

    Error in hclustfun(distfun(x)) :
    NA/NaN/Inf in foreign function call (arg 11)
    Traceback (most recent call last):
    File "./calculate_ani.py", line 1124, in
    robjects.r(rstr)
    File "/Library/Python/2.7/site-packages/rpy2-2.3.9-py2.7-macosx-10.8-intel.egg/rpy2/robjects/__init__.py", line 240, in __call__
    res = self.eval(p)
    File "/Library/Python/2.7/site-packages/rpy2-2.3.9-py2.7-macosx-10.8-intel.egg/rpy2/robjects/functions.py", line 86, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
    File "/Library/Python/2.7/site-packages/rpy2-2.3.9-py2.7-macosx-10.8-intel.egg/rpy2/robjects/functions.py", line 35, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
    rpy2.rinterface.RRuntimeError: Error in hclustfun(distfun(x)) :
    NA/NaN/Inf in foreign function call (arg 11)

    ReplyDelete
  2. Hi GL,

    Thanks for trying the script. It looks like the issue is with one of the functions in the rp2 module but, from the information you give, I can't tell exactly what is causing the problem.

    You seem to be on OSX 10.8, using Python 2.7, with rpy2 v2.2.3.9 - that much I can tell. But in order to investigate further I will need some example data - as small as possible - that replicates the error, and to know how you ran the script (i.e. what the command-line was).

    It would be useful if you could please report the issue (with the information above) via the GitHub Issues page at https://github.com/widdowquinn/scripts/issues - this is a more convenient route, keeps everyone's issues together in a single location over the history of a project, and doesn't require me to moderate comments before they're public ;)

    Thanks,

    L.

    ReplyDelete
  3. Looking at your error again, my first suspicion is that you've got at least one pair of species returning zero matches, and this is causing a NaN/Inf in R's clustering process, that hclust() doesn't like. Might that be what's going on?

    ReplyDelete
  4. Hi,

    Nice script, I wish I could make it work here :-) I installed everything needed on an Ubuntu VM. Then I ran:

    python calculate_ani.py -o ~/host/Ecoli/ANI/output -i ~/host/Ecoli/ANI -v -f -l ~/host/Ecoli/ANI -g --format png -m ANIm --nucmer_exe /usr/bin

    And got:

    No handlers could be found for logger "calculate_ani.py"

    I double checked that the executables are in /usr/bin. Same error with nucmer/ANIm.

    Actually I got other different errors too, but I can't even reproduce them now. Yesterday, the script seemed to work, but I got empty results (nothing in the results file). This was before I realized that the script used BLAST+ (right?), so I installed it today.

    Could you please help me go through this error?

    ReplyDelete
  5. Hi Dudu,

    Please could you report this issue at https://github.com/widdowquinn/scripts/issues?

    At the moment, I don't have enough information to tell what might be causing your problem. When you report at https://github.com/widdowquinn/scripts/issues could you please provide a full traceback of the errors you see and, ideally, a minimal dataset that reproduces the error?

    Many thanks,

    L.

    ReplyDelete
  6. Hi there!

    Did you ever had the chance to have a look at the easy-to-use Genome-to-Genome Distance Calculator (GGDC; http://ggdc.dsmz.de) which has many advantages over the ANI approach? For example, the most important feature for any genome sequence-based replacement of conventional DDH is that these alternatives (ANI, GGDC, etc.) must have high correlations with the empirical DDH results because procaryotic species delineation is ultimately based on these results for decades now. If not, procaryotic species delineation would quickly become inconsistent. That said, the GGDC has the higest correlations to wet-lab DDH without mimicking its common pitfalls.

    Cheers!

    ReplyDelete
    Replies
    1. Hi Phil,

      Sorry for the delay in replying. Thanks for your post - I wasn't previously aware of GGDC, but it's interesting for a few reasons.

      I don't think that any sequence-based method must agree with DDH. DDH is a bulk measure of hybridisation/affinity as a physical process, not a direct measure of sequence identity, and not a direct measure of evolutionary history. DDH does correlate pretty well with what we infer (from sequence data) of evolutionary history, but as a measurement it is not very precise, and it is still sometimes inconsistent within specific evolutionary groupings. I therefore do not think it serves well as a gold standard for computational approaches using sequence data. I think that assessing the performance of methods against DDH establishes only consistency with DDH, not correctness of the method or the inferred evolutionary organisation of genomes.

      Computational methods that use genome sequence data, and proceed from reasonable assumptions (as ANIm does - more so than ANIb) are capable of much greater resolution of differences between genomes. Especially in conjunction with more detailed analyses of the ordering of similar regions, they offer a far more powerful means of organising genome/organism evolution than DDH.

      As for the definition of bacterial 'species', I'm a bit of a sceptic. I don't think it makes a huge amount of sense as a concept but, as we need to have definitions for organisation and legislation/policy reasons, I'm happy to keep them as an organising principle. That does not presuppose that the DDH-delimited boundaries are correct. Where a sequence-based delineation is more appropriate, I think that a conflicting DDH boundary can be discarded.

      As WGS continues to fall in cost and preparation/analysis time, the bulk of classification is going to be carried out computationally, and there are many possible ways of doing this. The useful methods are good methods; the good methods that are fast, stable, and robust are the best of those. GGDC's and ANIb's reliance on BLAST (whose algorithm - and results - changed between BLAST and BLAST+) makes me less confident in their long-term robustness and stability than I am in ANIm.

      L.

      Delete
  7. Hi,

    Nice script, It's very usefull. I hope that you can fix my issue here. I'm asking if there is a way to extend value shown in the figure and add more color to it? Indeed, I have in my data values above 0.9 but some are below but I can't see them, and all those values turn to be blue.

    Thanks for the help,

    Fety,

    ReplyDelete
    Replies
    1. Hi Fety,

      It's certainly possible to use more/different colours, but you'd have to change the script directly. In the case of that script, using the verbose (-v) option should give you R code that you can modify directly to re-draw the output using the ANI data files you already generated.

      It's worth noting that the script you're using is out of date now, and has been/will be replaced entirely by pyani (https://github.com/widdowquinn/pyani). This module currently allows for direct modification of colour schemes in the pyani_config.py file. If you're calling the module from your own code, you can modify it there, as well. I don't see an easy way to provide complete customisation of colour schemes from the command-line, I'm afraid.

      L.

      Delete
    2. Hi,

      Thank you very much for your help, I started to use pyani and it fix my problem.

      Fety,

      Delete