Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
> Data Munging for Non-Programming Biologists [Bookmark it]
Data Munging for Non-Programming Biologists

Data Munging for Non-Programming Biologists

by Amir Karger, Eitan Rubin
October 20, 2005

Have you ever renamed 768 files? Merged the content from 96 files into a spreadsheet? Filtered 100 lines out of a 20,000-line file?

Have you ever done these things by hand?

Disciples of laziness--one of the three Perl programmer's virtues--know that you should never repeat anything five times, let alone 768. It dismayed me to learn that biologists do this kind of thing all the time.

On the Origin of Scripts: The Problem

Experimental biologists increasingly face large sets of large files in often-incompatible formats, which they need to filter, reformat, merge, and otherwise munge (definition 3). Biologists who can't write Perl (most of them) often end up editing large files by hand. When they have the same problem a week later, they do the same thing again--or they just give up.

My job description includes helping biologists to use computers. I could just write tailored, one-off scripts for them, right? As an answer, let me tell you about Neeraj. Neeraj is a typical NPB (non-programming biologist) who works down the hall. He came into my office, saying, "I have 12,000 sequences that I need to make primers for." I said, "Make what?" (I haven't been doing biology for very long.) Luckily, we figured out pretty quickly that all he wants to do is get characters 201-400 from each DNA sequence in a file. Those of you who have been Perling for a while can do this with your eyes closed (if you can touch type):

perl -ne 'print substr($_, 200, 200), "\n"' sequences.in >

    primers.out

Voilá! I gave Neeraj his output file and he went away, happy, to finish building his clone army to take over the world. (Or was he going to genetically modify rice to solve world hunger? I keep forgetting.)

Unfortunately, that wasn't the end. The next day, Neeraj came back, because he also wanted primers from the back end of the sequences (substr($_, -400, 200)). Because he's doing cutting-edge research, he may have totally different requirements next month, when he finishes his experiments. With just a few people in our group supporting hundreds or even thousands of biologists, writing tailored scripts, even quick one-liners, doesn't scale. Other common solutions, such as teaching biologists Perl or creating graphical workflow managers, didn't seem to fully address the data manipulation problem especially for occasional users, who won't be munging every day.

We need some tool that allows Neeraj, or any NPB, to munge his own data, rather than relying on (and explaining biology to) a programmer. Keeping the biologist in the loop this way gives him the best chance of applying the relevant data and algorithms to answer the right questions. The tool must be easy for a non-programmer to learn and to remember after a month harvesting fish eyes in Africa. It should also be TMTOWTDI-compliant, allowing him to play with data until he can sculpt it in the most meaningful way. While we're at it, the tool will need to evolve rapidly as biologists ask new questions and create new kinds of data at an ever-increasing rate.

When I told Neeraj's story to others in our group, they said that they have struggled with this problem for years. During one of our brainstorming sessions, my not-so-pointy-haired boss, Eitan Rubin, said, "Wouldn't it be nice if we could just give them a book of magical data-munging scripts that Just Work?" "Hm--a sort of Script Tome?" And thus the Scriptome was born. (The joke here is that every self-respecting area of study in biology these days needs to have "ome" in its name: the genome, proteome, metabolome. There's even a journal called OMICS now.)

Harnessing the Power of the Atom

The Scriptome is a cookbook for munging biological data. The cookbook model nicely fits the UNIX paradigm of small tools that do simple operations. Instead of UNIX pipes, though, we encourage the use of intermediate files to avoid errors.

We use a couple of tricks in order to make this cookbook accessible to NPBs. We use the familiar web browser as our GUI and harness the power of hyperlinking to develop a highly granular, hierarchical table of contents for the tools. This means we can include dozens to hundreds of tools, without requiring users to remember command names. Another trick is syntax highlighting. We gray out most of the Perl, to signify that reading it is optional. Parameters--such as filenames, or maximum values to filter a certain column by--we highlight in attention-getting red. Finally, we make a conscious effort to avoid computer science or invented terminology. Instead, we use language biologists find familiar. For example, tools are "atoms," rather than "snippets."

Each Scriptome tool consists of a Perl one-liner in a colored box, along with a couple of sentences of documentation (any more than that and no one will read it), and sample inputs and outputs. In order to use a tool, you:

  • Pick a tool type, perhaps "Choose" to choose certain lines or columns from a file.
  • Browse a hierarchical table of contents.
  • Cut and paste the code from the colored box onto a Unix, Mac OS X, or Windows command line. (Friendlier interfaces are in alpha testing--a later section explains more.)
  • Change red text as desired, using arrow keys or a text editor.
  • Hit Enter.
  • That's it!

Figure 1
Figure 1. A Scriptome tool for finding unique lines in a file--click image for full-size screen shot.

The tool in Figure 1 reads the input line by line, using Perl's -n option, and prints each line only when it sees the value in a given, user-editable column for the first time. The substitution removes a newline, even if it's a Windows file being read on a UNIX machine. Then the line is split on tabs. A hash keeps track of unique values in the given column, deciding which lines to print. Finally, the script prints to the screen a very quick diagnostic, specifically how many lines it chose out of how many total lines it read. (Choosing all lines or zero lines may mean you're not filtering correctly.)

By cutting and pasting tools like this, a biologist can perform basic data munging operations, without any programming knowledge or program installation (except for ActivePerl on Windows). Unfortunately, that's still not really enough to solve real-world problems.

[1] [2] [3] Next

[Bookmark][Print] [Close][To Top]
  • Prev Article-Programming:

  • Next Article-Programming:
  • Related Materias
    Using the For匛ach Stateme
    Crystal Reports Tutorial -
    Creating a Toolbar - Visua
    Using Office applications 
    Visual Basic Explorer - Tu
    Visual Basic Explorer - Tu
    Visual Basic Explorer - Si
    Visual Basic Explorer- Gam
    Visual Basic Explorer- Gam
    Visual Basic Explorer- Gam
    Topics
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Graphic Design Tutorial
     

    Coreldraw Tutorial

      Illustrator Tutorial
      3D Graphics Articles
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial&Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial&Articles
     

    XML Style Tutorial

      AJAX Tutorial
      XML Mobile
    Flash Tutorial&Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial&Articles
     

    Linux Tutorial

      Symbian Tutorial
      MacOS Tutorial