More Advancements in Perl Programming

More Advancements in Perl Programming
by Simon Cozens |

Parsing

Perhaps the biggest change of heart I had between writing a chapter and its publication was in the parsing chapter. That chapter had very little about parsing HTML, and what it did have was not very friendly. Since then, Gisle Aas and Sean Burke's HTML::TreeBuilder and the corresponding XML::TreeBuilder have established themselves as much simpler and more flexible ways to navigate HTML and XML documents.

The basic concept in HTML::TreeBuilder is the HTML element, represented as an object of the HTML::Element class:

$a = HTML::Element->new('a', href => 'http://www.perl.com/');

$html = $a->as_HTML;

This creates a new element that is an anchor tag, with an href attribute. The HTML equivalent in $html would be <a href="http://www.perl.com"/>.

Now you can add some content to that tag:

$a->push_content("The Perl Homepage");

This time, the object represents <a href="http://www.perl.com"> The Perl Homepage </a>.

You can ask this element for its tag, its attributes, its content, and so on:

$tag = $a->tag;

$link = $a->attr("href");

@content = $a->content_list; # More HTML::Element nodes

Of course, when you are parsing HTML, you won't be creating those elements manually. Instead, you'll be navigating a tree of them, built out of your HTML document. The top-level module HTML::TreeBuilder does this for you:

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new();

$tree->parse_file("index.html");

Now $tree is a HTML::Element object representing the <html> tag and all its contents. You can extract all of the links with the extract_links() method:

for (@{ $tree->extract_links() || [] }) {

     my($link, $element, $attr, $tag) = @$_;

     print "Found link to $link in $tag\n";

}

Although the real workhorse of this module is the look_down() method, which helps you pull elements out of the tree by their tags or attributes. For instance, in a search engine indexer, indexing HTML files, I have the following code:

for my $tag ($tree->look_down("_tag","meta")) {

    next unless $tag->attr("name");

    $hash{$tag->attr("name")} .= $tag->attr("content"). " ";

}



$hashMore Advancements in Perl Programming .= $_->as_text." " for $tree->look_down("_tag","title");

This finds all <meta> tags and puts their attributes as name-value pairs in a hash; then it puts all the text inside of <title> tags together into another hash element. Similarly, you can look for tags by attribute value, spit out sub-trees as HTML or as text, and much more, besides. For reaching into HTML text and pulling out just the bits you need, I haven't found anything better.

On the XML side of things, XML::Twig has emerged as the usual "middle layer," when XML::Simple is too simple and XML::Parser is, well, too much like hard work.

Templating

There's not much to say about templating, although in retrospect, I would have spent more of the paper expended on HTML::Mason talking about the Template Toolkit instead. Not that there's anything wrong with HTML::Mason, but the world seems to be moving away from templates that include code in a specific language (say, Perl's) towards separate templating little languages, like TAL and Template Toolkit.

The only thing to report is that Template Toolkit finally received a bit of attention from its maintainer a couple of months ago, but the long-awaited Template Toolkit 3 is looking as far away as, well, Perl 6.

Natural Language Processing

Who would have thought that the big news of 2005 would be that Yahoo is relevant again? Not only are they coming up with interesting new search technologies such as Y!Q, but they're releasing a lot of the guts behind what they're doing as public APIs. One of those that is particularly relevant for NLP is the Term Extraction web service.

This takes a chunk of text and pulls out the distinctive terms and phrases. Think of this as a step beyond something like Lingua::EN::Keywords, with the firepower of Yahoo behind it. To access the API, simply send a HTTP POST request to a given URL:

use LWP::UserAgent;

use XML::Twig;

my $uri  = "http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction";

my $ua   = LWP::UserAgent->new();

my $resp = $ua->post($uri, {

    appid   => "PerlYahooExtractor",

    context => <<EOF

Two Scottish towns have seen the highest increase in house prices in the

UK this year, according to new figures. 

Alexandria in West Dunbartonshire and Coatbridge in North Lanarkshire

both saw an average 35% rise in 2005. 

EOF

});

if ($resp->is_success) { 

    my $xmlt = XML::Twig->new( index => [ "Result" ]);

    $xmlt->parse($resp->content);

    for my $result (@{ $xmlt->index("Result") || []}) {

        print $result->text;

    }

}

This produces:

north lanarkshire

scottish towns

west dunbartonshire

house prices

coatbridge

dunbartonshire

alexandria

Once I had informed the London Perl Mongers of this amazing discovery, Simon Wistow immediately bundled it up into a Perl module called Lingua::EN::Keywords::Yahoo, coming soon to a CPAN mirror near you.

Prev  [1] [2] [3] [4] Next

Close    To Top
  • Prev Article-Programming:
  • Next Article-Programming:
  • Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction