Monday 21 January 2013

KEGGWatch, part I

In which I attempt to visualise metabolic maps for comparative genomics, and lead up to making a contribution to Biopython.

KEGG - the Kyoto Encyclopaedia of Genes and Genomes - has been around for almost two decades, now.  A prescient and visionary repository of metabolic pathways and other biochemical data, it has been a go-to resource for me for nearly 15 years.  Unfortunately, and symptomatic of much of academia, after 16 years of free, open access, FTP access was transferred to a subscription model in mid-2011 (in case you're interested, the yearly subscription for a single user is just shy of my allocated yearly consumables allowance; I don't have a subscription). However, KEGG still presents a wide range of useful, if intentionally limited to avoid abuse, web services giving access to the data in various ways, including running functional annotation via KAAS, and the generation of pathway maps representing combined presence of enzymes corresponding to a reaction - a kind of pan-metabolome - across a range of organisms.

KEGG map 01100 (metabolic pathways) for Dickeya, produced by KEGG.
KEGG will give you, for example, a metabolic map showing only those KOs (KEGG orthologues) that have annotated representatives in a set of organisms.  You can be pretty wide-ranging with this, and the maps are pretty (I really like the large-scale metabolic maps, and the sections of metabolism with neat, figurative drawings) and informative.  But I wanted more flexibility: KEGG gives you green lines on the metabolic maps, and presence/absence information; I want to be able to modify line width and colour, and relabel sections, and colour according to arbitrary values, like with MapMan.

So, there are a couple of issues (not including my needy spoilt-child demands): the KEGG web interface is never going to do exactly what I want, and nor is it going to do so for every KEGG map at once, when I run another comparative genomic analysis - they don't exist to cater for my whims; also, if I want to generate these images locally, the raw data from the FTP site is $2000 away. Ideally, what I want is a local (maybe programmatic) package that can grab KEGG maps and the associated data on-the-fly, and render publication-quality images with arbitrary data overlays representing the many forms and sources of data I haven't even thought of, yet. That's not a lot to ask, surely…

There are lots of good KEGG visualisation tools out there, including (but not limited to) Cytoscape's KGML plug-in kgmlreader, KGML-Ed, and KEGGtranslator (for example, the gorgeous but - in terms of analysing your own data, rather limited iPath; the webservice MicroarrayDB, which is nearly there, but not as flexible as I'd like; and KEGG-anim which, again, is lovely but not what I'm after).  They're all good at what they do, if not entirely consistent with each other, or with KEGG. kgmlreader in particular is nice for the ability to mouseover detailed information about a pathway, and being able to exploit the power of Cytoscape to edit and render beautiful images.  But they're not quite what I was after.  For example, kgmlreader's representation of the large-scale metabolic pathways isn't as aesthetically-pleasing as I'd like (yeah, I know, "get over it"): you have to zoom in quite far, losing context, before you even see any labels; more importantly, all connections are straight lines, so you lose all that lovely layout information that's in the original file.

KEGG map ko01110, rendered in Cytoscape with kgmlreader plug-in
KEGGtranslator has other issues - it's worked fine for me before, but today it appears to be a bit flaky, refusing to render ko01110, and taking an unfeasibly long time to render ko00020:
KEGG map ko00020, rendered with KEGGtranslator
KGML-Ed annoys me by having a Java WebStart mode of operation (downloads for Windows and OSX are referred to in the tutorial, but I didn't find them), and asking for (even though it does not insist on) licensed access to the KEGG database.  It renders ko00020 as:

KEGG map ko00020, rendered with KGML-Ed
It appears to be a rather powerful KGML editing tool, with neat output - my favourite from the three packages.

Cytoscape/kgmlreader renders ko00020 as:

KEGG map ko00020, rendered in  Cytoscape with kgmlreader plug-in
which is all very nice, and useful. But none have the flow of the corresponding KEGG layout:

KEGG layout of ko00020
and there is a very good reason for this: the KEGG maps are intelligently manually drawn.

That shouldn't be a barrier to reproduction (architectural/mapping software has been translating hand-drawings into transferable data for years), but KEGG provides the pathway information as the KGML dialect of XML (spec here) and, for these pathways, doesn't provide sufficient information to reconstruct their pretty layout.  For ko00020, the graphics elements - which carry rendering information such as location, shape and colour - are restricted to the ortholog, gene, compound and map elements: these are the circles, rectangles, and rounded rectangles in each figure.  The connections between those elements must be inferred from reaction and relation elements in the KGML file, and don't contain any graphical information.  This leads, sensibly, to the 'every connection is a straight line' philosophy in these three rendering packages.  However, the elegant manual solutions to the problem of laying out complex networks found by the KEGG team are not available, so there are multiple line-crossings, and much potential for confusion.

Also, having been through this process myself now, I can see issues with the renderings.  For example, Cytoscape/kgmlreader and KGML-Ed have correctly rendered the distinction between maplink (dotted line) and ECrel (continuous line) relation elements, where KEGGtranslator has not.  Also, something has gone wrong with C00074 (phosphoenolpyruvate) and its connection to glycolysis, and for C00068 (ThPP), which have (amongst other elements) both disappeared in the KEGGtranslator rendering.

Now, KEGG obviously uses software to render these maps.  This is, as far as I can tell, KegSketch - described at various points on the web as 'in-house software'.  I infer from references to it that each map is manually-drawn.  I have, so far, been unable to obtain a copy or the co-ordinates for connecting lines - though I've not emailed and asked directly, yet.

The point about retaining the very elegant KEGG representations, while having some programmatic or other editorial control over data presentation, is made well when considering any of the pathway maps that contain figurative drawings, such as ko02040: flagellar assembly.
KEGG layout of ko02040
Looks great.  Now for the KEGG rendering packages:

KEGGtranslator layout of ko02040
Cytoscape/kgmlreader layout of ko02040
KGML-Ed layout of ko02040
None of the packages do well, here.  I wanted something that would give me output like this:
Prototype Python KGML library output
so I wrote something - and, indeed, that's its output, right there - which I'll describe in the next post.

No comments:

Post a Comment