I've been spending much of the last year and a bit sequencing and annotating around 25 bacterial genomes. I may write more about that, if I can keep up this cripplingly ferocious rate of blog posting. Recently, I hit a problem when annotating some of the conceptual translations from one of these genomes, using a local installation of InterProScan. The seqret.sh script was throwing a wobbly about one of these sequences, and complaining that ajseqtype.c "failed to fix sequence type". The error message did tell me which position in the sequence was the problem - just not which sequence it was in.
Running the problematic command-line by itself, followed by a bit of a grep in the sequence file, showed that the issue was the amino-acid single letter code 'J', which had crept into one of the conceptual translations. After 16 years of staring at sequences for a living, this might have been the first time I'd seen 'J' as an amino-acid code.
The conceptual translations I had came from a script that generates conceptual translations from a parent FASTA sequence file, and a GFF file describing the features, and uses the translate() method of Biopython's Seq object. Translating the same feature using the Artemis package gave an 'X' where Biopython gives a 'J'. So what was going on...
Everyone's favourite, most reliable source of information about everything ever, Wikipedia, notes that:
In addition to the specific amino acid codes, placeholders are used in cases where chemical or crystallographic analysis of a peptide or protein cannot conclusively determine the identity of a residueand these placeholders include 'J' to indicate an ambiguity between leucine and isoleucine. But the same table also includes 'X', 'B' and 'Z', which I'm fairly used to, by comparison. Was this an IUPAC code that had slipped by me?
According to the documentation at QMUL, the one-letter codes were first suggested as long ago as 1958, (by George Gamow!), and
J was avoided because it is absent from several languages.But those were the 1983 recommendations (it was also true in 1971, and a more current appendix agrees), and this is the 21st Century, man! And, besides, EMBL thinks that IUPAC includes the 'J':
The one-letter and three-letter abbreviation codes for amino acids for example, used in UniProtKB/Swiss-Prot are those adopted by the commission on Biochemical Nomenclature of the IUPAC-IUB and are as follows: [...]
|J||Xle||Leucine or Isoleucine|
Though not everyone agrees, including MEGA, UWisc, FAO and NCBI. And also EMBL's own InterProScan software, which was where we came in...
'J' is noted by EMBL explicitly as being used in the context of experimental ambiguity, rather than translational ambiguity - as was suggested in the linked Wikipedia article:
|J||Xle||leucine or isoleucine ("J" between "I" and "L", uncertain result of mass-spec)|
This was also the case with IUPAC, when discussing (in 1999, so nearly modern...) how best to represent selenocysteine:
For the one-letter symbol, J and U can be considered but J is used in NMR work as designation for signals assigned either to leucine or to isoleucine which cannot be distinguished from each other. Therefore U remains as the best letter to designate selenocysteine.This, incidentally, is the online reference cited by DDBJ in support of their use of 'J' as an amino acid ambiguity code. The journal reference (Eur. J. Biochem. (1984) 138 9-37) they use also doesn't include 'J' and appears to be just the published version of the 1983 recommendations linked above.
From my limited reading then, it seems that we have an ambiguity code 'J' introduced for NMR/crystallography/MS-based (hi, Phil!) experimental ambiguities that has not been adopted as part of the official IUPAC recommendations, and so has not been accounted for in some (if not most) software packages, which might well throw an error if you pass them 'J' as input. Meanwhile, at least two authoritative institutions have decided to adopt 'J' as an amino-acid ambiguity symbol. Biopython, quite reasonably following DDBJ/EMBL's lead, has tried to be helpful in translating an ambiguous codon as a 'J', rather than 'X' but, because this seemingly hasn't been standardised, not everyone else has agreed to parse it appropriately in their own code, hence the downstream problems.
My own solution? Emacs find and replace 'X' for 'J'... seemed simplest. Does anyone else out there have a more authoritative account of the validity of 'J'?