Data Munging for Non-Programming Biologists

Data Munging for Non-Programming Biologists
by Amir Karger, Eitan Rubin |

Splicing the Scriptome

As it happens, the story I told you about Neeraj earlier wasn't entirely accurate. He actually wanted to print both the beginning and ending substrings from his sequences. Also, his input was in the common FASTA format, where each sequence has an ID line like >A2352334 followed by a variable number of lines with DNA letters. We don't have one tool that parses FASTA and takes out two different substrings; writing every possible combination of tools would take even longer than writing Perl 6 (ahem). Instead, again following UNIX, we leave it up to the biologist to combine the tools into problem-specific solutions. In this case, that solution would involve using the FASTA-to-table converter, followed by a tool to pull out the sequence column, and then two copies of the substring tool.

We're asking biologists to break a problem down into pieces--each of which is solvable using some set of tools--and then to string those tools together in the right order with the right parameters. That sounds an awful lot like programming, doesn't it? Although you may not think about it anymore, some of the fundamental concepts of programming are new and difficult. Luckily, it turns out that biologists learned more in grad school than how to extract things out of (reluctant) other things. In fact, they already know how to break down problems, loop, branch, and debug; instead of programming, though, they call it developing protocols. They also already have cookbooks for experimental molecular biology. Such a protocol might include lines like:

  1. Add 1 ml of such-and-such enzyme to the DNA.
  2. Incubate test tube at 90 degrees C for an hour.
  3. If the mixture turns clear, goto step 5.
  4. Repeat steps 2-3 three times.
  5. Pour liquid into a sterile bottle very carefully.

We borrowed the term "protocol" to describe an ordered set of parameterized Scriptome tools that solves a larger problem. (The right word for this is a script, but don't tell our users--they might realize they're learning how to program.) We feature some pre-written protocols on the website. Note that because each tool is a command-line command, a set of them together is really just an executable shell script.

The Scriptome may be even more than a high-level, mostly syntax-free, non-toy language for NPBs. Because it exposes the Perl directly on the website--giving new meaning to the term "open source"--some curious biologists may even start reading the short, simple, relevant examples of Perl code. (Unfortunately, putting the command into one line makes it harder to read. One of our TODOs is an Explain button next to each tool, which would show you a commented, multi-line version of each script.) From there, it's a short hop to tweaking the tools, and before you know it, we'll have more annoying newbie posts on comp.lang.perl.misc!

Intelligent Design: The Geeky Details

If you've read this far, you may have realized by now that the Scriptome is not a programming project at heart. Design, interface, documentation, and examples are as important as the programming itself, which is pretty easy. This being an article on Perl.com, though, I want to discuss the use of Perl throughout the project.

Why Perl?

Several people asked me why we didn't write the Scriptome in Python, or R, or just use UNIX sh for everything. Well, other than the obvious ("It's the One True Language!"), Perl data munges by design, it's great for fast tool development, it's portable to many platforms, and it's already installed on every Unix and Mac OS X box. Moreover, the Bioperl modules offer me a huge number of tools to steal, um, reuse. Finally, Perl is the preferred language of the entire Scriptome development team (me).

The Scriptome Build

Even though we're trying to keep the tools pretty generic and simple, we know we'll need several dozen at least, to be at all useful. In addition, data formats and biologists' interests will change over time. We knew we had to make the process of creating a new tool fast and automatic.

I write the tool pages in POD, which lets me use Vim rather than a fancy web-page editor. My Makefile runs pod2html to create a nice, if simple, web page that includes the table of contents for free. A Perl filter then adds a navigation bar and some simple interface-enhancing JavaScript, and makes the parameters red. I may give in and switch to a templating system, database back end, or XML eventually, and automated testing would be great. For now, keeping it simple means I can create, test, document, and publish a new tool in under an hour. (Okay, I didn't include debugging in that time.)

Perl Culture

There's lots of Perl code in the project, but I'm trying to incorporate some Perl attitude as well. The "Aha!" moment of the Scriptome came when we realized we could just post a bunch of hacked one-liners on the Web to help biologists now, rather than spend six or 12 months crafting the perfect solution. While many computational biologists focus on writing O(N) programs for sophisticated sequence analysis or gene expression studies, we're not ashamed to write glue instead; we solve the unglamorous problem of taking the output from their fancy programs and throwing it into tabular format, so that a biologist can study the results in Excel. After all, if data munging is even one step in Neeraj's pipeline, then he still can't get his paper published without these tools. Finally, we're listening aggressively to our users, because only they can tell us which easy things to make easy and which hard things to make possible.

Prev  [1] [2] [3] Next

Close    To Top
  • Prev Article-Programming:
  • Next Article-Programming:
  • Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction