Analyzing HTML with Perl

Analyzing HTML with Perl

by Kendrew Lau
January 19, 2006

Routine work is all around us every day, no matter if you like it or not. For a teacher on computing subjects, grading assignments can be such work. Certain computing assignments aim at practicing operating skills rather than creativity, especially in elementary courses. Grading this kind of assignment is time-consuming and repetitive, if not tedious.

In a business information system course that I taught, one lesson was about writing web pages. As the course was the first computing subject for the students, we used Nvu, a WYSIWYG web page editor, rather than coding the HTML. One class assignment required writing three or more inter-linked web pages containing a list of HTML elements.

Write three or more web pages having the following:

  • Italicized text (2 points)
  • Bolded text (2 points)
  • Three different colors of text (5 points)
  • Three different sizes of text (5 points)
  • Linked graphics with border (5 points)
  • Linked graphics without border (5 points)
  • Non-linked graphics with border (3 points)
  • Non-linked graphics without border (2 points)
  • Three external links (5 points)
  • One horizontal line--not full width of page (5 points)
  • Three internal links to other pages (10 points)
  • Two tables (10 points)
  • One bulleted list (5 points)
  • One numerical list (5 points)
  • Non-default text color (5 points)
  • Non-default link color (2 points)
  • Non-default active link color (2 points)
  • Non-default visited link color (2 points)
  • Non-default background color (5 points)
  • A background image (5 points)
  • Pleasant appearance in the pages (10 points)

Beginning to grade the students' work, I found it monotonous and error-prone. Because the HTML elements could be in any of the pages, I had to jump to every page and count the HTML elements in question. I also needed to do it for each element in the requirement. While some occurrences were easy to spot in the rendered pages in a browser, others required close examination of the HTML code. For example, a student wrote a horizontal line (<hr> element) extending 98 percent of the width of the window, which was difficult to differentiate visually from a full-width horizontal line. Some other students just liked to use black and dark gray as two different colors in different parts of the pages. In addition to locating the elements, awarding and totaling marks were also error-prone.

I felt a little regret on the flexibility in the requirement. If I had fixed the file names of the pages and assigned the HTML elements to individual pages, grading could have been easier. Rather than continuing the work with regret, I wrote a Perl program to grade the assignments. The program essentially parses the web pages, awards marks according to the requirements, writes basic comments, and calculates the total score.

Processing HTML with Perl

Perl's regular expressions have excellent text processing capability and there are handy modules for parsing web pages. The module HTML::TreeBuilder provides a HTML parser that builds a tree structure of the elements in a web page. It is easy to create a tree and build its content from a HTML file:

$tree = HTML::TreeBuilder->new;

$tree->parse_file($file_name);

Nodes in the tree are HTML::Element objects. There are plenty of methods with which to access and manipulate elements in the tree. When you finish using the tree, destroy it and free the memory it occupied:

$tree->delete;

The module HTML::Element represents HTML elements in tree structures created by HTML::TreeBuilder. It has a huge number of methods for accessing and manipulating the element and searching for descendants down the tree or ancestors up the tree. The method find() retrieves all descending elements with one or more specified tag names. For example:

@elements = $element->find('a', 'img');

stores all <a> and <img> elements at or under $element to the array @elements. The method look_down() is a more powerful version of find(). It selects descending elements by three kinds of criteria: exactly specifying an attribute's value or a tag name, matching an attribute's value or tag name by a regular expression, and applying a subroutine that returns true on examining desired elements. Here are some examples:

@anchors = $element->look_down('_tag' => 'a');

retrieves all <a> elements at or under $element and stores them to the array @anchors.

@colors = $element->look_down('style' => qr/color/);

selects all elements at or under $element having a style attribute value that contains color.

@largeimages = $element->look_down(

    sub {

         $_[0]->tag() eq 'img'          and

        ($_[0]->attr('width') > 100 or

         $_[0]->attr('height')  > 100)

    }

);

locates at or under $element all images (<img> elements) with widths or heights larger than 100 pixels. Note that this code will produce a warning message on encountering an <img> element that has no width or height attribute.

You can also mix the three kinds of criteria into one invocation of look_down. The last example could also be:

@largeimages = $element->look_down(

    '_tag'   => 'img',

    'width'  => qr//,

    'height' => qr//,

    sub { $_[0]->attr('width')  > 100 or

          $_[0]->attr('height') > 100 }

);

This code also caters for any missing width or height attribute in an <img> element. The parameters 'width' => qr// and 'height' => qr// guarantee selection of only those <img> elements that have both width or height attributes. The code block checks these for the attribute values, when invoked.

The method look_up() looks for ancestors from an element by the same kinds of criteria of look_down().

[1] [2] [3] Next

Close    To Top
  • Prev Article-Programming:
  • Next Article-Programming:
  • Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction