Friday, 16 March 2012


In which a blog post is brought to you by the letter 'J', but should probably have been 'X'-rated.

I've been spending much of the last year and a bit sequencing and annotating around 25 bacterial genomes.  I may write more about that, if I can keep up this cripplingly ferocious rate of blog posting.  Recently, I hit a problem when annotating some of the conceptual translations from one of these genomes, using a local installation of InterProScan.  The script was throwing a wobbly about one of these sequences, and complaining that ajseqtype.c "failed to fix sequence type".  The error message did tell me which position in the sequence was the problem - just not which sequence it was in.

Running the problematic command-line by itself, followed by a bit of a grep in the sequence file, showed that the issue was the amino-acid single letter code 'J', which had crept into one of the conceptual translations.  After 16 years of staring at sequences for a living, this might have been the first time I'd seen 'J' as an amino-acid code.

The conceptual translations I had came from a script that generates conceptual translations from a parent FASTA sequence file, and a GFF file describing the features, and uses the translate() method of Biopython's Seq object.  Translating the same feature using the Artemis package gave an 'X' where Biopython gives a 'J'.  So what was going on...

Everyone's favourite, most reliable source of information about everything ever, Wikipedia, notes that:
In addition to the specific amino acid codes, placeholders are used in cases where chemical or crystallographic analysis of a peptide or protein cannot conclusively determine the identity of a residue
and these placeholders include 'J' to indicate an ambiguity between leucine and isoleucine.  But the same table also includes 'X', 'B' and 'Z', which I'm fairly used to, by comparison.  Was this an IUPAC code that had slipped by me?

According to the documentation at QMUL, the one-letter codes were first suggested as long ago as 1958, (by George Gamow!), and
J was avoided because it is absent from several languages.
But those were the 1983 recommendations (it was also true in 1971, and a more current appendix agrees), and this is the 21st Century, man!  And, besides, EMBL thinks that IUPAC includes the 'J':
The one-letter and three-letter abbreviation codes for amino acids for example, used in UniProtKB/Swiss-Prot are those adopted by the commission on Biochemical Nomenclature of the IUPAC-IUB and are as follows: [...] 
JXle  Leucine or Isoleucine

Though not everyone agrees, including MEGA, UWisc, FAO and NCBI.  And also EMBL's own InterProScan software, which was where we came in...

'J' is noted by EMBL explicitly as being used in the context of experimental ambiguity, rather than translational ambiguity - as was suggested in the linked Wikipedia article:

JXleleucine or isoleucine ("J" between "I" and "L", uncertain result of mass-spec)

This was also the case with IUPAC, when discussing (in 1999, so nearly modern...) how best to represent selenocysteine:
For the one-letter symbol, J and U can be considered but J is used in NMR work as designation for signals assigned either to leucine or to isoleucine which cannot be distinguished from each other. Therefore U remains as the best letter to designate selenocysteine.
This, incidentally, is the online reference cited by DDBJ in support of their use of 'J' as an amino acid ambiguity code.  The journal reference (Eur. J. Biochem. (1984) 138 9-37) they use also doesn't include 'J' and appears to be just the published version of the 1983 recommendations linked above.

From my limited reading then, it seems that we have an ambiguity code 'J' introduced for NMR/crystallography/MS-based (hi, Phil!) experimental ambiguities that has not been adopted as part of the official IUPAC recommendations, and so has not been accounted for in some (if not most) software packages, which might well throw an error if you pass them 'J' as input.  Meanwhile, at least two authoritative institutions have decided to adopt 'J' as an amino-acid ambiguity symbol.  Biopython, quite reasonably following DDBJ/EMBL's lead, has tried to be helpful in translating an ambiguous codon as a 'J', rather than 'X' but, because this seemingly hasn't been standardised, not everyone else has agreed to parse it appropriately in their own code, hence the downstream problems.

My own solution?  Emacs find and replace 'X' for 'J'... seemed simplest.  Does anyone else out there have a more authoritative account of the validity of 'J'?


  1. I've emailed Dr Moss as the IUPAC contact for clarification, so far no reply.

    In the medium term I'd go with asking the EBI to fix Interproscan to tolerate J (just treat it as X if need be), and encouraging IUPAC to formalise J / Xle as leucine or isoleucine.

    In the short term mapping J to X in the FASTA files is the most practical option.

  2. Thanks, Peter. It'll be interesting to hear what response you get from IUPAC.

    Substituting 'X' for 'J' seems like a pragmatic solution all-round to me. Since EBI's new, reimplemented version of InterProScan (v5) is in beta, it might be worth bringing up - if they've not already dealt with it (I was using v4.8). I imagine that, much of the time, there's nothing sensible and easy that can be done with such an ambiguous Leu/Ile call. Where there is (e.g. having some kind of combined score in a substitution matrix) the problem of 'J' likely comes up so rarely as not to be worth anyone's time to implement.

    I think you're right that if IUPAC were to formalise 'J' as an ambiguity symbol, it might put sufficient pressure on new and maintained code to handle it explicitly as input (even if just converting to 'X' internally). That might help ease the pain of this particular gotcha.