Friday 8 February 2013

Surely this has been done already...

In which it turns out that if it was done already, it was hiding somewhere. Also, I share a script that retrieves the corresponding nucleotide coding sequences from NCBI, given only a set of protein sequences.


I often have very specific little problems that seem, at least initially, like they're so simple that they should already have been solved and wrapped up in a nice library with informative error messages (like Biopython), or there's a script in github, or on a site somewhere. Recently though, I drew a blank with something I thought should have been such a common problem it must have been done already. In an undergraduate exercise, if nowhere else.

Given a protein sequence, find the corresponding nucleotide coding sequence in NCBI 

(tl;dr - my script is available here at github)

I'm not so daft that I fail to realise that not all protein sequences in any given database are certain to have a known coding sequence. I'm not even the only person who wants to know this sort of thing, judging from other people's enquiries on BioStar/StackExchange. Judging from the responses they received, there aren't many people with instant answers or a script lying around.

It turned out to be a decent example of how straightforward little questions don't always have straightforward little answers, and why scripting is an essential component of any bioinformatician's armoury. So I thought I'd write up what I did, here. Complete with insight into my fevered thought processes.

About a year ago I was asked by one of our wet biologists to look into a sequence family from plants, the NAC transcription factors. This family is characterised by the presence of a NAM domain, and it's very easy to find all the protein sequences in the NCBI database using the Pfam domain model. Once I'd identified the protein sequences of interest, I intended to reconstruct a phylogenetic tree of NAM domains for some of the family members. For this I would need the nucleotide sequences (the idea being that we can align the protein sequences to take advantage of conservation at the 'functional level', then backtranslate using the coding sequence).

My starting point was a set of sequences from NCBI, with GI, accession and version numbers (and additional region notation in a bastardisation of Stockholm format indicating the start and end points of the matched domain, as gi|[0-9]*|ref|.*|/[0-9]*-[0-9]*). Given the GI number (or NCBI accession) it's normally easy enough to find the coding sequence, where it exists, using the NCBI website:
  1. Put your accession number into the search field (e.g. 10697197).
  2. See that it has one or more matches in the protein databases, and follow the link.
  3. Note that the GenBank record has a CDS feature, with the accession number for a gene, and a coded_by qualifier. We can put that in the search field too, to get the corresponding nucleotide sequence.
Many times this gets us to the coding sequence directly, and we can grab the FASTA sequence and extract our region of interest using any number of tools. But here we have a whole plant chromosome sequence, so we have to extract the sequence corresponding to the coding CDS feature, and then extract the part of the coding sequence that corresponds to the region of the protein that we're interested in. Also, we really ought to produce a conceptual translation of what we find, as a check, just to be sure we haven't made a complete mess of finding the sequence.

Even if I only had one sequence to process I'd find this a bit of a drag and wish for an easier way. But for the NAC family I had 2000 or so of these to do, so pointing and clicking like that simply will not fly.  Which is, obviously, where scripting comes in.

Happily, NCBI provide the Entrez E-utilities, which gives us a way to query across the databases programmatically. My weapon of choice is Python, so the Biopython Bio.Entrez module is the natural solution, here. There are tools here to identify nucleotide sequences that code for our protein of interest, and to retrieve those sequences. Since we're scripting, we can use the information we have about the location of interest to extract the appropriate section of the parent sequence, and write it out to a FASTA file directly.

Before starting, I had to make sure the Entrez DTD was in place locally:
$ wget http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd
$ mv eLink_101123.dtd ~/.biopython/Bio/Entrez/DTDs

Then we can grab our coding sequence using (Bio)Python, and ELink from EUtils, which lets us make connections between Entrez databases. We need to tell it an email address, as this is how NCBI shout at you if you make too many requests and/or choke their bandwidth.
In [1]: from Bio import Entrez
In [2]: Entrez.email="your.email@your.domain"
In [3]: nt_accessions = Entrez.read(Entrez.elink(dbfrom="protein", id="10697197", linkname="protein_nuccore"))
In [4]: nt_accessions
Out[4]: [{u'LinkSetDb': [{u'DbTo': 'nuccore', u'Link': [{u'Id': '55417890'}, {u'Id': '10697187'}], u'LinkName': 'protein_nuccore'}], u'DbFrom': 'protein', u'IdList': ['10697197'], u'LinkSetDbHistory': [], u'ERROR': []}]

And this is where the first complication arises in this example, as NCBI records two different coding sequences for this protein: rice chromosome 1; and a PAC clone of rice chromosome 1. Depending on the query NCBI might have returned a chromosome, a PAC/BAC clone, an exact coding sequence, or something else - or any combination of these, and more. Assuming that all the returned sequences code exactly for the protein, we could just take the first returned entry - but this might be very large (like a plant chromosome) and there might be another protein sequence in our set for which this codes. There's not much point in annoying NCBI by repeatedly downloading the same large sequence each time so we'll want to cache entries locally (and probably to disk, at that - there's always a chance the script will die, and we'd be as well getting data from a local cache on the next run). But if there's a choice of coding sequences, why not just choose the smallest one? Unfortunately, that information doesn't get returned by Elink, so we'll have to download the records themselves to find out which it is.

If we had to download the complete records for every sequence returned by Elink, we'd be pounding through a lot of information we didn't need. But happily Entrez permit download of GenBank entries in several formats. We can choose one of the shorter formats that still contains the sequence length:
In [5]: [link["Id"] for link in nt_accessions[0]['LinkSetDb'][0]['Link']]
Out[5]: ['55417890', '10697187']
In [6]: nt_records = [Entrez.efetch(db='nucleotide',id=acc,rettype='gb',retmode='text').read() for acc in [link["Id"] for link in nt_accessions[0]['LinkSetDb'][0]['Link']]]
In [7]: len(nt_records)
Out[7]: 2

At this point, we've obtained plain text versions of the (short) GenBank records for each of the two sequences, which we could cache to disk for future use. But Biopython's Entrez tools return handles that we can parse on-the-fly, such as at an interactive prompt.
In [8]: nt_records = [Entrez.read(Entrez.efetch(db='nucleotide',id=acc,rettype='gb',retmode='xml')) for acc in [link["Id"] for link in nt_accessions[0]['LinkSetDb'][0]['Link']]]
In [9]: for record in nt_records:
   ....:     print record[0]['GBSeq_primary-accession'], record[0]['GBSeq_other-seqids'], record[0]['GBSeq_length']
   ....:     
BA000010 ['dbj|BA000010.8|', 'gi|55417890'] 43342410
AP002818 ['dbj|AP002818.2|', 'gi|10697187'] 144644

We can see immediately that AP002818 - the PAC clone sequence - is the shorter of the two so, if we want to minimise network transfer, we can download this sequence, rather than the whole chromosome. We want the full record that has the sequence attached this time, so we choose the 'gbwithparts' return type, and use Biopython's excellent SeqIO module to read the returned text:
In [10]: from Bio import SeqIO
In [11]: record = SeqIO.read(Entrez.efetch(db='nucleotide',id='10697187',rettype='gbwithparts',retmode='text'), 'gb')
In [12]: record.id
Out[12]: 'AP002818.2'
In [13]: len(record)
Out[13]: 144644

A couple of checks show that we've downloaded the intended sequence, and we can now extract our coding sequence. As ever, Python (like other computing languages) uses a 0-offset numbering, while the location of our motif was given in the 1-offset numbering most humans use when counting. In our case, we're interested in residues 23-155 (or [22:155] in Python numbering).  But first, we have to find our feature of interest. There are 109 in the record we've just downloaded, 26 of which are CDS.  
In [14]: len(record.features)
Out[14]: 109
In [15]: len([f for f in record.features if f.type == 'CDS'])
Out[15]: 26

We need to be able to identify the one that codes for our protein. Happily, each CDS has feature qualifiers, one of which (db_xref) contains database cross-references for the CDS. In a well-ordered world, one of those cross-references will contain an identifier for our original query protein.
In [16]: for f in record.features:
   ....:     if 'CDS' == f.type:
   ....:         print f.qualifiers['protein_id'], f.qualifiers['db_xref']
   ....:         
['BAB16319.1'] ['GI:10697188']
['BAB16320.1'] ['GI:10697189']
['BAB16321.1'] ['GI:10697190']
['BAB16322.1'] ['GI:10697191']
['BAB16323.1'] ['GI:10697192']
['BAD44829.1'] ['GI:52075659']
['BAD44830.1'] ['GI:52075660']
['BAD44831.1'] ['GI:52075661']
['BAD44833.1'] ['GI:52075663']
['BAD44832.1'] ['GI:52075662']
['BAD44834.1'] ['GI:52075664']
['BAB16328.1'] ['GI:10697197']
['BAD44835.1'] ['GI:52075665']
['BAD44836.1'] ['GI:52075666']
['BAB16331.1'] ['GI:10697200']
['BAB16332.1'] ['GI:10697201']
['BAD44837.1'] ['GI:52075667']
['BAB16335.1'] ['GI:10697204']
['BAD44838.1'] ['GI:52075668']
['BAD44839.1'] ['GI:52075669']
['BAD44840.1'] ['GI:52075670']
['BAD44841.1'] ['GI:52075671']
['BAB16338.1'] ['GI:10697207']
['BAD44842.1'] ['GI:52075672']
['BAD44843.1'] ['GI:52075673']
['BAD44844.1'] ['GI:52075674']

We see that this is so, so we can pull this feature out directly:
In [17]: feature = [f for f in record.features if 'CDS' == f.type and 'GI:10697197' in f.qualifiers['db_xref']][0]
In [18]: feature
Out[18]: SeqFeature(FeatureLocation(ExactPosition(56881),ExactPosition(58015)), type='CDS', strand=1)
In [19]: print feature
type: CDS
location: [56881:58015]
strand: 1
qualifiers: 
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:10697197']
    Key: gene, Value: ['P0436E04.14']
    Key: note, Value: ['contains EST(s): AU032425(R4069),AU082730(R4069)']
    Key: product, Value: ['putative OsNAC5 protein']
    Key: protein_id, Value: ['BAB16328.1']
    Key: translation, Value: ['MTIDLQLPAAACGDHHTAAGAGLPPGFRFHPTDEELLLHYLGKRAAAAPCPAPVIAEVDIYKYNPWELPAMAVFGESDGEWYFFSPRDRKYPNGVRPNRAAGSGYWKATGTDKPISISETQQTVLLGVKKALVFYRGRPPKGTKTSWIMHEYRLANAAASSSSSYTSNMKQLASSSSSSSSSASMRVRTCYGASANYIYIYPLPYTPSTDRYIYMRNYIHARTQLDEWVLCRIYKKKEANQQLQHYIDMMMDDDNDDEHNLQVQQQQQQQAQSHRMPRPPSISDYLLDYSDDLPPSTDQTPSLHLGFTAVNEGNNKRHKTMEEYYSISISTADMLHASSSTSNNKSTQINFSSIFEPQTPAAAGHQLMSSHNDDTSI']

But, as you'll notice, this doesn't give us the nucleotide sequence. It so happens that this sequence has a simple location, with unambiguous start and end points on the forward strand, so we could simply pull out the nucleotide sequence from our record. But many features aren't so straightforward, and comprise two or more subfeatures, which may be in more than one orientation, and lie in any order. These would be more difficult to convert back to the coding sequence, but Biopython provides the extract() method, which turns the whole process into a one-line operation, if we specify the parent nucleotide sequence.
In [20]: feature_nt = feature.extract(record.seq)
In [21]: feature_nt
Out[21]: Seq('ATGACCATCGACCTGCAGCTTCCGGCGGCGGCGTGCGGTGATCATCACACAGCA...TGA', IUPACAmbiguousDNA())

And, from here, it's a simple process to grab the subsequence corresponding to our motif (correcting for off-by-zero numbering):
In [22]: motif_nt = feature_nt[3*(23-1):3*155]
In [23]: motif_nt
Out[23]: Seq('CTGCCGCCGGGATTCCGGTTCCACCCGACGGACGAGGAGCTCCTCCTCCACTAC...GCC', IUPACAmbiguousDNA())
In [24]: motif_nt.translate()
Out[24]: Seq('LPPGFRFHPTDEELLLHYLGKRAAAAPCPAPVIAEVDIYKYNPWELPAMAVFGE...RLA', ExtendedIUPACProtein())

and this corresponds to our motif sequence.
>gi|10697197|dbj|BAB16328.1|/23-155 AC gi|10697197|dbj|BAB16328.1| Match to NAM domain (Pfam:PF02365) region:23..155
LPPGFRFHPTDEELLLHYLGKRAAAAPCPAPVIAEVDIYKYNPWELPAMAVFGESDGEWY
FFSPRDRKYPNGVRPNRAAGSGYWKATGTDKPISISETQQTVLLGVKKALVFYRGRPPKG
TKTSWIMHEYRLA

So, the process for a script might run like this...

  1. Read in the query protein sequences you want CDS for.
  2. Loop over them, and find possible CDS in NCBI by checking a local cache, or using ELink (caching to disk); if there is more than one possibility, check the local cache or download the headers (caching to disk) and choose the shortest.
  3. For each available CDS, check the local cache or download the corresponding full GenBank record (caching to disk), and identify the feature that corresponds to the query protein.
  4. Obtain the nucleotide sequence for that feature, and trim to the motif site (checking the conceptual translation to be sure that we've got the right sequence).
  5. Write your sequence to a file
But Entrez allows for batch querying by means of the Entrez History Server, using EPost. So, if we've got a large number of queries, we can upload their identifiers in one go (or in batches), run the remote searches all at once, and then download all the results in one go. Entrez recommend a batch size of around 500 identifiers.  We can test this out with our potential coding sequences from earlier:

In [25]: id_list = ['55417890', '10697187']
In [26]: history = Entrez.read(Entrez.epost("nucleotide", id=','.join(id_list)))
In [27]: history
Out[27]: {u'QueryKey': '1', u'WebEnv': 'NCID_1_79616895_130.14.22.101_9001_1334680917_437807442'}

Entrez gives us two things: a QueryKey, which is essentially your history search number; and a WebEnv, that is very much like a cookie. We can use these to carry out a search and get results all in one go:
In [28]: history_seqs = list(SeqIO.parse(Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", webenv=history['WebEnv'], query_key=history['QueryKey']), 'fasta'))
In [29]: history_seqs
Out[29]: 
[SeqRecord(seq=Seq('ACGACGCAGCTATGGCCTCCCCGCCCACCAGGCCGCCAGCCACAACCAGCTGAC...AAT', SingleLetterAlphabet()), id='gi|55417890|dbj|BA000010.8|', name='gi|55417890|dbj|BA000010.8|', description='gi|55417890|dbj|BA000010.8| Oryza sativa Japonica Group genomic DNA, chromosome 1, complete sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('TCCCAAAACAATGTGTCTATGGTCTTCCGAATTCCTAGTCTCAGCATTGTGCAC...CTT', SingleLetterAlphabet()), id='gi|10697187|dbj|AP002818.2|', name='gi|10697187|dbj|AP002818.2|', description='gi|10697187|dbj|AP002818.2| Oryza sativa Japonica Group genomic DNA, chromosome 1, PAC clone:P0436E04', dbxrefs=[])]

So it seems that we can add optional batching to our script. If we're also caching locally, we'd want to check the input accessions against the current caches at each stage, and only submit to Entrez those queries for which we don't already have local data.

Putting this all together into a script with command-line options, some use of the logging module to handle messaging (or optional logging to a suitable destination file/rsyslog if you're ambitious), and attempts at retries when network errors, we have this at github: https://github.com/widdowquinn/scripts/blob/master/bioinformatics/get_NCBI_cds_from_protein.py. I hope it's as useful to someone else, as it was to me.

Comments and criticisms to the usual place, below…

No comments:

Post a Comment