Lexing Your Data

Lexing Your Data

by Curtis Poe
January 05, 2006

s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g

Most of us have tried at one time or another to use regular expressions to do things we shouldn't: parsing HTML, obfuscating code, washing dishes, etc. This is what the technical term "showing off" means. I've done it too:

$html =~ s{

             (<a\s(?:[^>](?!href))*href\s*)

             (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)

             ([^>]+>)

          }

          {$1 . decode_entities($2) .  $4}gsexi;

I was strutting like a peacock when I wrote that, followed quickly by eating crow when I ran it. I never did get that working right. I'm still not sure what I was trying to do. That regular expression forced me to learn how to use HTML::TokeParser. More importantly, that was the regular expression that taught me how difficult regular expressions can be.

The Problem with Regular Expressions

Look at that regex again:

 /(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/

Do you know that matches? Exactly? Are you sure? Even if it works, how easily can you modify it? If you don't know what it was trying to do (and to be fair, don't forget it's broken), how long did you spend trying to figure it out? When's the last time a single line of code gave you such fits?

The problem, of course, is that this regular expression is trying to do far more work than a single line of code is likely to do. When facing with a regular expression like that, there are a few things I like to do.

  • Document it carefully.
  • Use the /x switch so I can expand it over several lines.
  • Possibly, encapsulate it in a subroutine.

Sometimes, though, there's a fourth option: lexing.

Lexing

When developing code, we typically take a problem and break it down into a series of smaller problems that are easier to solve. Regular expressions are code and you can break them down into a series of smaller problems that are easier to solve. One technique is to use lexing to facilitate this.

Lexing is the act of breaking data down into discrete tokens and assigning meaning to those tokens. There's a bit of fudging in that statement, but it pretty much covers the basics.

Parsing typically follows lexing to convert the tokens into something more useful. Parsing is frequently the domain of some tool that applies a well-defined grammar to the lexed tokens.

Sometimes well-defined grammars are not practical for extracting and reporting information. There might not be a grammar available for a company's ad-hoc log file format. Other times you might find it easier to process the tokens manually then to spend the time writing a grammar. Still other times you might only care about part of the data you've lexed, not all of it. All three of these reasons apply to some problems.

Parsing SQL

Recently, on Perlmonks (parse a query string), someone had some SQL to parse:

select the_date as "date",

round(months_between(first_date,second_date),0) months_old

,product,extract(year from the_date) year

,case

  when a=b then 'c'

  else 'd'

  end tough_one

from ...

where ...

The poster needed the alias for each column from that SQL. In this case, the aliases are date, months_old, product, year, and tough_one. Of course, this was only one example. There's actually plenty of generated SQL, all with subtle variations on the column aliases, so this is not a trivial task. What's interesting about this, though, is that we don't give a fig about anything except the column aliases. The rest of the text is merely there to help us find those aliases.

Your first thought might be to parse this with SQL::Statement. As it turns out, this module does not handle CASE statements. Thus, you must figure out how to patch SQL::Statement, submit said patch, and hope it gets accepted and released in a timely fashion. (Note that SQL::Statement uses SQL::Parser, so the latter is also not an option.)

Second, many of us have worked in environments where we have problems to solve in production now, but we still have to wait three weeks to get the necessary modules installed, if we can get them approved at all.

The most important reason, though, is even if SQL::Statement could handle this problem, this would be an awfully short article if you used it instead of a lexer.

[1] [2] [3] [4] Next

Close    To Top
  • Prev Article-Programming:
  • Next Article-Programming:
  • Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction