Making Dictionaries with Perl

Making Dictionaries with Perl
by Sean M. Burke |

Conditional Output and Example Sentences

There's two optional parts of the entries that we haven't used yet: the citation fields, like "\cit hsd; led-149; led-411", and the example sentences field, like "\ex 'Láa hal súut hlgitl'gán. | She said harsh words to her.". The citation fields are typically only of importance to the editors, who might want to spot-check words against the places in the text where they were found. (And typically the editors are the only ones who would be fluent with the abbreviations, e.g., would know that "led-149" is short for "page 149 of the Leer-Edenso Dictionary of Haida".)

Ideally our program should produce output for the editors with the citations, and output for normal users (without the citations). We can do this my having a $For_Editors variable that's set early on in the program:

  my $For_Editors = 0; # set to 1 to add citations

And then later on we have code that uses that variable:

  foreach my $headword ( custom_sort keys %headword2entries) {

    foreach my $entry ( @{ $headword2entries{$headword} } ) {

      print_entry( $entry );

    }

  }



  sub print_entry {

    my %e = %{$_[0]};

    $rtf->paragraph(

      [ \'\b',    $e{'hw'}  || "?hw?", ": " ],

      [ \'\b\i',  $e{'pos'} || "?pos?" ],

      " ", $e{'engl'} || "?english?", ".", 

      $For_Editors && $e{'cit'} ? " [$e{'cit'}]" : (),

    );

  }

Our new and punctuation-rich $For_Editors && $e{'cit'} line is just a concise way of saying "if this is for the editors and if there's a citation in this entry, then print a space and a bracket before it, and a bracket after it -- otherwise don't add anything".

Our example sentences ("\ex 'Láa hal súut hlgitl'gán. | She said harsh words to her".) should probably end up in any normal dictionary, but of course we wouldn't want to try adding the contents of $e{'ex'} with formatting codes around it if it weren't actually present in this entry. We can use the same sort of $value ? "...$value..." : () idiom we used before -- except that this time we need to first take out the "|" that separates the Haida part from the English translation. That's simple, though:

    my($ex, $ex_eng);

    ($ex, $ex_eng) = split m/\|/, $e{'ex'} if $e{'ex'};

    $rtf->paragraph(

      ...

      $ex_eng ? (" $ex = $ex_eng") : (),

    );

With that code in place, our entries that have example sentences, show them, like this:

Fancier Formatting

Now that basically everything else about our program is working, how about we polish it off with some formatting codes to make it look just right. We've already got some simple bold and italic codes, so the next thing is certainly to use different fonts. We could use, say, Bookman for the main headword and Times for the rest of the entry -- except for in the example sentence, we can use Bookman again for the Haida text, and Arial for the English translation.

However, a glance at the RTF Pocket Guide shows no RTF code that means "change to the font 'Arial'" -- just a code that means "change to font number N [i.e., the second font we declare for this document]", This declaring is just a matter of adding a parameter 'fonts' = [ ...font names...],> to that dull $rtf->prolog() method we called back when we created $rtf. As the RTF::Writer documentation notes, "You should be sure to declare all fonts that you switch to in your document (as with \'\f3', to change the current font to what's declared in entry 3 (counting from 0) in the font table)." So if we just change our prolog call to this...

  $rtf->prolog( 'fonts' => [ "Times New Roman", "Bookman", "Arial" ] );

... Then we can use a \f0 to switch to Times New Roman (which is the default, incidentally, since it's the first declared font), and \f1 to switch to Bookman, and \f2 to switch to Arial.

And suppose we want everything to be in 10-point, except for the Arial part, which we want in specifically 9-point Arial so it won't steal attention from the rest of the text, as sans-serif fonts often do. That's just a matter of a \fs20 and \fs18 code -- "fs" for "font size", plus the desired font size, in half-points. (Odd, I know.)

With these extra codes in place, our print_entry routine now looks like this:

  sub print_entry {

    my %e = %{$_[0]};

    my($ex, $ex_eng);

    ($ex, $ex_eng) = split m/\|/, $e{'ex'} if $e{'ex'};

    $rtf->paragraph(  \'\fs20',  # Start out in ten-point

      [ \'\f1\b', $e{'hw'}  || "?hw?", ": " ],

      [ \'\b\i',  $e{'pos'} || "?pos?" ],

      " ", $e{'engl'} || "?english?", ".", 

      $For_Editors && $e{'cit'} ? " [$e{'cit'}]" : (),

      $ex_eng ? (" ", \'\f1', $ex, \'\f2\fs18', $ex_eng) : (),

    );

  }

It's dense, but then it does a lot of work! And that work comes out looking like this:

As to adding fancier formatting, this is usually best done by just flipping through the RTF Pocket Guide and looking for a mention of the effect you want. For example, in a lexicon we might be particularly interested in hanging indents (\fi-300), two-column pages (\col2), and page numbering ({\header \pard\ql\plain p.\chpgn \par}).

Now suppose that you're trying to make the most of your xeroxing budget, trading off nice large readable point size against how many people get copies. One way to squeeze as much content into as small a space is to use abbreviations for the most repeated text in the dictionary -- the part-of-speech tags. So we can turn "noun" into just "n.", "verb" into "v.", and so on. Each time, we save only a little space, but it adds up quick. And doing this (or at least trying it out to see how it looks) is straightforward. We need only change one line in print_entry(), from this

      [ \'\b\i',  $e{'pos'} || "?pos?" ],

To this:

      [ \'\b\i',  $Abbrev{$e{'pos'}||''} || $e{'pos'} || "?pos?" ],

And earlier we'll have to define what should be in %Abbrev:

  my %Abbrev = (

   'auxiliary verb' => 'aux.',

   qw(noun n. verb v. adverb adv.),

  );

But that's all it takes to change our output to look like this:

That continues to print "?pos?" when an entry is erroneously missing the part-of-speech field. And it doesn't abbreviate the term "postposition". (If we did so, it'd probably be "pp.", which people would probably think was "participle" or something.) But the most common terms, "noun" and "verb", got shrunk down, saving a few characters per entry, which could add up to a dozen pages in a large printout.

Other Formats

I've just been talking about producing conventionally formatted dictionaries, but the same database and the same kinds of Perl could be used to instead produce different output formats. Use a bit of fancy page layout and a double-sided printer (or copier) and the same data can be turned into readymade flashcards. Or if you have a subject field in entries (like "plant", "color", "body part", "food"), it's easy to re-sort entries by topic, and produce a "topical dictionary", which language teachers find very useful in planning classroom exercises.

World Enough and Time

As A. N. Whitehead's famous quote goes, "Civilization advances by extending the number of important operations which we can perform without thinking about them. Operations of thought are like cavalry charges in a battle - they are strictly limited in number, they require fresh horses, and must only be made at decisive moments." I've found this to be personally and critically true in dealing with endangered languages: it takes man-years of time to produce a dictionary of any useful size, both on the part of linguists and of members of the community. And with most of North America's native languages, the most fluent speakers are over 65, so there's no great surplus of man-years.

Whitehead was more right than he knew: saving time and effort doesn't just advance civilizations, it can help save them.

So when Perl helps us glue together a database program, a printer, and a word processor so that the typesetting phase of a dictionary takes not months, but minutes, this frees up the linguists and teachers and elders to spend scarce time and "decisive moments" working on preserving the language through study and teaching. We need every minute to work on revitalizing these languages that are the foundation of endangered cultures and civilizations -- with all their stories, poems, songs, sayings, proverbs, figures of speech, jokes, liturgy, and heaps of specialized jargon from botany and agriculture and healing and just plain ways of relating to people and the world, very little of which would survive mere translation to English.

We're in a hurry, and so we really appreciate Perl.

Finished Code for Sample Haida Dictionary

  use strict;

  use warnings;



  my $For_Editors = 0; # set to 1 to add citations



  use RTF::Writer;

  use Text::Shoebox::Lexicon;

  my $lex = Text::Shoebox::Lexicon->read_file( "haida.sf" );



  my $rtf = RTF::Writer->new_to_file( "lex.rtf" );

  $rtf->prolog( 'fonts' => [ "Times New Roman", "Bookman", "Arial" ] );



  use Sort::ArbBiLex (

    'custom_sort' =>

    "

     a A à À á Á â Â ã Ã ä Ä å Å æ Æ

     b B

     c C ç Ç

     d D ð Ð

     e E è È é É ê Ê ë Ë

     f F

     g G

     h H

     i I ì Ì í Í î Î ï Ï

     j J

     k K

     l L

     m M

     n N ñ Ñ

     o O ò Ò ó Ó ô Ô õ Õ ö Ö ø Ø

     p P

     q Q

     r R

     s S ß

     t T þ Þ

     u U ù Ù ú Ú û Û ü Ü

     v V

     w W

     x X

     y Y ý Ý ÿ

     z Z

    "

  );

  my %headword2entries;

  my %english2native;



  my %Abbrev = (

   'auxiliary verb' => 'aux.',

   qw(noun n. verb v. adverb adv.),

  );



  foreach my $entry ($lex->entries) {

    my(%e) = $entry->as_list;

    push @{ $headword2entries{ $e{'hw'} } },  \%e;

    my @reversed = $e{'ehw'} ? split( m/\s*;\s*/, $e{'ehw'} )

                             : reversables( $e{'engl'} );

    foreach my $engl ( @reversed ) {

      push @{ $english2native{ $engl } }, $e{'hw'}

    }

  }



  $rtf->paragraph( "Haida to English Dictionary\n\n" );



  foreach my $headword ( custom_sort keys %headword2entries) {

    foreach my $entry ( @{ $headword2entries{$headword} } ) {

      print_entry( $entry );

    }

  }



  $rtf->paragraph( "\n\nEnglish to Haida Index\n" );



  foreach my $engl ( custom_sort keys %english2native) {

    my $native = join "; ", custom_sort @{ $english2native{ $engl } };

    $rtf->paragraph( "$engl: $native" );

  }



  $rtf->close;

  exit;





  sub reversables {

    my $in = shift || return;

    my @english;

    foreach my $term ( grep $_, split /\s*;\s*/, $in ) {

      $term =~ s/^(a|an|the|to)\s+//;

      push @english, $term;

    }

    return @english;

  }





  sub print_entry {

    my %e = %{$_[0]};

    my($ex, $ex_eng);

    ($ex, $ex_eng) = split m/\|/, $e{'ex'} if $e{'ex'};

    $rtf->paragraph(  \'\fs20',  # Start out in ten-point

      [ \'\f1\b', $e{'hw'}  || "?hw?", ": " ],

      [ \'\b\i',  $Abbrev{$e{'pos'}||''} || $e{'pos'} || "?pos?" ],

      " ", $e{'engl'} || "?english?", ".", 

      $For_Editors && $e{'cit'} ? " [$e{'cit'}]" : (),

      $ex_eng ? (" ", \'\f1', $ex, \'\f2\fs18', $ex_eng) : (),

    );

  }

Prev  [1] [2] [3] 

Close    To Top
  • Prev Article-Programming:
  • Next Article-Programming:
  • Now: Tutorial for Web and Software Design > Programming > Perl > Programming Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction