Analyzing HTML with Perl

Analyzing HTML with Perl
by Kendrew Lau |

Processing Multiple Files

These methods provide great HTML parsing capability to grade the web page assignments. The grading program first builds the tree structures from the HTML files and stores them in an array @trees:

my @trees;

foreach (@files) {

    print "  building tree for $_ ...\n" if $options{v};

    my $tree = HTML::TreeBuilder->new;

    $tree->parse_file($_);

    push( @trees, $tree );

}

The subroutine doitem() iterates through the array of trees, applying a pass-in code block to look for particular HTML elements in each tree and accumulating the results of calling the code block. To provide detailed information and facilitate debugging during development, it calls the convenience subroutine printd() to display the HTML elements found with their corresponding file name when the verbose command line switch (-v) is set. Essentially, the code invokes this subroutine once for each kind of element in the requirement.

sub doitem {

    my $func = shift;

    my $num  = 0;

    foreach my $i ( 0 .. $#files ) {

        my @elements = $func->( $files[$i], $trees[$i] );

        printd $files[$i], @elements;

        $num += @elements;

    }

    return $num;

}

The code block passed into doitem is a subroutine that takes two parameters of a file name and its corresponding HTML tree and returns an array of selected elements in the tree. The following code block retrieves all HTML elements in italic, including the <i> elements (for example, <i>text</i>) and elements with a font-style of italic (for example, <span STYLE="font-style: italic">text</span>).

$n = doitem sub {

    my ( $file, $tree ) = @_;

    return ( $tree->find("i"),

        $tree->look_down( "style" => qr/font-style *: *italic/ ) );

    };



marking "Italicized text (2 points): "

  . ( ( $n > 0 ) ? "good. 2" : "no italic text. 0"

);

Two points are available for any italic text in the pages. The marking subroutine records grading in a string. At the end of the program, examining the string helps to calculate the total points.

Other requirements are marked in the same manner, though some selection code is more involved. A regular expression helps to select elements with non-default colors.

my $pattern = qr/(^|[^-])color *: *rgb\( *[0-9]*, *[0-9]*, *[0-9]*\)/;

return $tree->look_down(

    "style" => $pattern,

    sub { $_[0]->as_trimmed_text ne "" }

);

Nvu applies colors to text by the color style in the form of rgb(R,G,B) (for example, <span STYLE="color: rgb(0, 0, 255);">text</span>). The above code is slightly stricter than the italic code, as it also requires an element to contain some text. The method as_trimmed_text() of HTML::Element returns the textual content of an element with any leading and trailing spaces removed.

Nested invocations of look_down() locate linked graphics with a border. This selects any link (an <a> element) that encloses an image (an <img> element) that has a border.

return $tree->look_down(

    "_tag" => "a",

    sub {

       $_[0]->look_down( "_tag" => "img", sub { hasBorder( $_[0] ) } );

    }

);

Finding non-linked graphics is more interesting, as it involves both the methods look_down() and look_up(). It should only find images (<img> elements) that do not have a parent link (a <a> element) up the tree.

return $tree->look_down(

    "_tag" => "img",

    sub { !$_[0]->look_up( "_tag" => "a" ) and hasBorder( $_[0] ); }

);

Checking valid internal links requires passing look_down() a code block that excludes common external links by checking the href value against protocol names, and verifies the existence of the file linked in the web page.

use File::Basename;

$n = doitem sub {

    my ( $file, $tree ) = @_;

    return $tree->look_down(

        "_tag" => "a",

        "href" => qr//,

        sub {

            !( $_[0]->attr("href") =~ /^ *(http:|https:|ftp:|mailto:)/)

            and -e dirname($file) . "/" . decodeURL( $_[0]->attr("href") );

        }

    );

};

Nvu changes a page's text color by specifying the color components in the style of the body tag, like <body style="color: rgb(0, 0, 255);">. A regular expression matches the style pattern and retrieves the three color components. Any non-zero color component denotes a non-default text color in a page.

my $pattern = qr/(?:^|[^-])color *: *rgb\(( *[0-9]*),( *[0-9]*),( *[0-9]*)\)/;

return $tree->look_down(

    "_tag"  => "body",

    "style" => qr//,

    sub {

        $_[0]->attr("style") =~ $pattern and

        ( $1 != 0 or $2 != 0 or $3 != 0 );

    }

);

With proper use of the methods look_down(), look_up(), and as_trimmed_text(), the code can locate and mark the existence of various required elements and any broken elements (images, internal links, or background images).

Prev  [1] [2] [3] Next

Close    To Top
  • Prev Article-Programming:
  • Next Article-Programming:
  • Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction