Thursday 8 September 2011

A Pile of Sheets

In which a niggle with Excel files is easily crushed into nothingness.  If you have a Mac.

Earlier today, I spotted this tweet from Neil Saunders (@neilfws), bemoaning the state in which we often receive data from our experimental colleagues:
Dear every biologist. The colours in your spreadsheet don't show up when I export it to CSV. Thanks.
Very true, and well said, sir!

I appreciate that there are many reasons why people who only have to deal with the spreadsheet itself would want to annotate their data with colour.  It's not just that colours are pretty, and it's nicer to work with pretty things, but colour can provide useful visual clues to the structure of data, and highlight data of particular importance.  We're visual creatures, and respond well to colour.  It's just such a pity that, whenever you want to take data from a spreadsheet and do something useful with it programmatically, you can't carry the colour information or all the extra insight it gives, which is typically implicit rather than explicit, over.

The more annoying issue that the 'cryptic rainbow' annotation format seems never to involve anyone keeping a written record anywhere of what the colours might mean, or attaching an explanation in their emails, is a moan for another day (yes, I have a backlog of them).

But what came up at work almost immediately after reading that tweet was a problem with manipulating Excel spreadsheets programmatically.  Excel spreadsheets have a structure which is essentially hierarchical: they are Workbooks, that contain Worksheets (that contain Rows and Columns, and so on).  When you create a new 'spreadsheet', you're usually creating a new Workbook container, and working on the first Worksheet.  This is convenient when partitioning your information for actually working within Excel.  Not so convenient when the person you're working with separates their many objects of interest into individual Worksheets within a single Workbook, and you want to collect all that information into one place with a script.  Especially as Excel's file format is, to all intents and purposes, opaque to pretty much everything other than Excel or things that are so like Excel as to make no odds.

In this case, I was presented with an Excel spreadsheet (Workbook) that contained a number of Worksheets, each one containing similar information about the 'superscaffolds' that contribute to a chromosome 'pseudomolecule'.  Since this was organised as tables in Excel, I wanted a similar tabular plaintext format containing the same data.  I could have tried saving the Workbook in that format, but you can only save individual Worksheets in that way, and that's a pain.  I would be forced to save every individual Worksheet in a suitable comma- or tab-separated format by hand. I do this a lot, and I'm sick of it. I'm lazy, to boot, so the natural response is obviously to write a script so I never have to do that again.

I normally like to use Python but, since the library interfaces to Office applications require a wee bit of learning and are usually, er... idiosyncratic, I decided to take advantage of being on a Mac, and use the much-unloved AppleScript. For all its unusual approach, it is a very quick and convenient way to get things into and out of (supported) applications on OSX.  A wee bit of Googling, dictionary-checking, and a tiny bit of scripting got me to the point where I could select an Excel spreadsheet, and export each of its component worksheets into a new tab-separated plaintext file, under a new directory with the same name as the source file, in the same location as the source file.  It even uses the OS file dialogue.

# Select Excel workbook via dialogue
set theWorkbookFile to choose file with prompt "Please select an Excel Workbook"

# Open Excel and get useful information
tell application "Microsoft Excel"
  open theWorkbookFile
  set workbookName to name of active workbook
  if workbookName ends with ".xls" then set workbookName to text 1 thru -5 of workbookName
  if workbookName ends with ".xlsx" then set workbookName to text 1 thru -6 of workbookName
end tell

# Create new folder for output
set outputDirectory to (theWorkbookFile as text) & "_extracted"
if outputDirectory ends with ":" then set outputDirectory to text 1 thru -2 of outputDirectory
do shell script "mkdir -p " & quoted form of POSIX path of outputDirectory

# Loop over worksheets and write out in tab-separated format
tell application "Microsoft Excel"
  set theSheets to worksheets of active workbook
  repeat with aSheet in theSheets
    set thisPath to outputDirectory & ":" & workbookName & "_" & name of aSheet & ".tab"
    save aSheet in thisPath as text Mac file format
  end repeat
  close active workbook without saving
end tell

After placing the script under ~/Library/Scripts and making sure that my script menu is visible from the menu bar (in AppleScript Editor's general preferences), I can now convert all those pesky Excel files to tab-separated text with just a couple of mouse-clicks.

But it still doesn't preserve colour...

No comments:

Post a Comment