Making Dictionaries with Perl
by Sean M. Burke
March 25, 2004
When you woke up this morning,
the last thing you are likely to have thought is "If only I had a dictionary!"
But there are thousands of languages on Earth that many people want to learn,
but they can't,
because there are little or no materials to start with: no Pocket Mohawk-English Dictionary, no Cherokee Poetry Reader,
no Everyday Otomi: Second Year.
Only in the past few years have people realized that these languages are not just curiosities,
but are basic indispensable, untranslatable parts of local cultures -- and they're disappearing in droves.
As I was learning Perl,
the long arm of coincidence put me in contact with a good number of linguists who work on producing materials to help the study of these endangered languages.
These folks work on producing textbooks and other "language materials,"
which is mostly straightforward, since the 1980s gave us "desktop publishing."
But there was one real trouble spot: dictionaries.
Writing a dictionary of any real size using just a word processor is maddening,
like writing a novel on Post-Its.
So they started using database programs,
but had no way to turn this into anything you could print and call a dictionary.
They had no way to take this:
Headword: dagiisláng
Citation: HSD
Part of speech: verb
English: wave a piece of cloth
Example: Dáayaangwaay hal dagiislánggan. | He was waving a flag.
And turn it into this:

"Well," I said and have been saying ever since, "This is no big deal, for you see, I am a programmer! Just export your database as CSV or something, email it to me, and I'll write a program that reads that and writes out a word-processor file with everything formatted all nice just like you want."
"A mere person, you, can program something that writes a word-processing document? But how can this be?! Surely this would require a year's work, a million lines of C++, and a bajillion dollars!"
"Yes. But instead I'll just use Perl, where I can do it in a few dozen lines of code, taking me just a few minutes." Because, you see, a conventionally formatted dictionary is just a glorified version of what people with business degrees would call a "database report", and people who work in cubicles generate such things all the time. And now I'll show you how it's done.
Reading the Input
Of course you'll need Perl, and that's not hard to come by. Then, at most, you just need a module for the input format and a module for the output format. And you don't even need that if the input and/or output formats are simple enough. In this case, the input format I'm often given is simple enough. It's called Shoebox Standard Format, and it looks like this:
\hw dagiisláng
\cit hsd
\pos verb
\engl wave a piece of cloth
\ex Dáayaangwaay hal dagiislánggan. | He was waving a flag.
\hw anáa
\cit hsd; led-285
\pos adverb
\engl inside a house; at home
\hw súut hlgitl'áa
\cit hsd; led-149; led-411
\engl speak harshly to someone; insult
\ex 'Láa hal súut hlgitl'gán. | She said harsh words to her.
\hw tlak'aláang
\cit led-398
\pos noun
\engl the shelter of a tree
Namely, \fieldname fieldvalue, each record ("entry") starting with a \hw field, and the records and fields being in no particular order. (And the data, incidentally, is vocabulary from Haida, an endangered language spoken in the Southeast Alaskan islands, where I live.)
Now, one could parse this with a regexp and a bit of while(<IN>) {...}, but there's already a module for this that will read in a whole file as a big data list-of-lists data structure. After just a glance at the module's documentation, we can write this simple program to read in the lexicon as an object, and dump it to make sure that it's getting well filled in:
use Text::Shoebox::Lexicon;
my $lex = Text::Shoebox::Lexicon->read_file( "haida.sf" );
$lex->dump;
And that prints this:
Lexicon Text::Shoebox::Lexicon=HASH(0x15550f0) contains 4 entries:
Entry Text::Shoebox::Entry=ARRAY(0x1559104) contains:
hw = "dagiisláng"
cit = "hsd"
pos = "verb"
engl = "wave a piece of cloth"
ex = "Dáayaangwaay hal dagiislánggan. | He was waving a flag."
Entry Text::Shoebox::Entry=ARRAY(0x1559194) contains:
hw = "anáa"
cit = "hsd; led-285"
pos = "adverb"
engl = "inside a house; at home"
Entry Text::Shoebox::Entry=ARRAY(0x155920c) contains:
hw = "súut hlgitl'áa"
cit = "hsd; led-149; led-411"
engl = "speak harshly to someone; insult"
ex = "'Láa hal súut hlgitl'gán. | She said harsh words to her."
Entry Text::Shoebox::Entry=ARRAY(0x1559284) contains:
hw = "tlak'aláang"
cit = "led-398"
pos = "noun"
engl = "the shelter of a tree"
A further glance shows that $lexicon->entries returns a list of the entry objects, and that $entry->as_list returns the entry's contents as a list (key1, value1, key2, value2) -- exactly the kind of list that is ripe for dumping into a Perl hash. So:
foreach my $entry ($lex->entries) {
my %e = $entry->as_list;
}
And that works perfectly, assuming we never have an entry like this:
\hw súut hlgitl'áa
\cit hsd; led-149; led-411
\engl speak harshly to someone
\engl insult
\ex 'Láa hal súut hlgitl'gán. | She said harsh words to her.
In that case, because there's two "engl" fields, $entry->as_list would return this:
(
'hw' => "súut hlgitl'áa",
'cit' => "hsd; led-149; led-411",
'engl' => "speak harshly to someone",
'engl' => "insult",
'ex' => "'Láa hal súut hlgitl'gán. | She said harsh words to her.",
)
And once we dump that into the hash %e, we would end up with just this:
(
'hw' => "súut hlgitl'áa",
'cit' => "hsd; led-149; led-411",
'engl' => "insult",
'ex' => "'Láa hal súut hlgitl'gán. | She said harsh words to her.",
)
...since, of course, hash keys have to be unique in Perl hashes. If you needed to deal with a lexicon that had such entries, there are various methods in the Text::Shoebox::Entry class, but for a simple lexicon where each field comes up just once per entry, you can just use a hash -- and you can even check that that's the case by calling with $entry->assert_keys_unique;, which normally does nothing -- unless it sees duplicate field names in that given entry, in which case it will abort the program and print a helpful error message about the offending entry.
But for our data, with its unique keys, a hash works just fine:
foreach my $entry ($lex->entries) {
my %e = $entry->as_list;
}
We would then do things with the contents of $e in that loop: either generating output right there, or putting it into Perl variables whose contents will later be output by other subroutines of ours.
Making the Output
Since we've got the basic input code squared away, now we get to think about how to output data. Once we know that, we'll know better how to write the code to make the formats meet in the middle.
As output formats go, HTML is good for many purposes; practically all
programmers can code in it pretty well, and just about everyone can hardcopy HTML with their browser or word processor. However, even after all these years, there are still some basic problems with HTML: as a typesetting language, there's still no reliable support for control of page-layout options like headers and page-numbering, page breaks, newspaper columns, and the like. More importantly, WYSIWYG HTML editors all seem to be harmless at best or disastrous at worst. In my experience, that has ruled out HTML as an output format for the many lexicons where the output file still needs various kinds of manual touching-up in a word processor.
Because of these problems with HTML, I have generally chosen RTF as my output format. RTF is technically a Microsoft format, but somehow, somehow, it avoided most of the lunacy that that usually entails. Moreover, just about every word processor supports it. And Microsoft Word both prints and edits RTF pretty much flawlessly. (After all, it had to be good at something.) And finally, there's good Perl support for generating RTF, via the CPAN modules RTF::Writer and RTF::Document, so you can almost completely insulate yourself from dealing directly with the language. I'll use RTF::Writer, simply because I'm more familiar with it. (This may be due to the fact that it was written by the author of the delightful O'Reilly book RTF Pocket Guide, a handsome and charming man whose modesty forbids him from revealing that he is me.)
With a bit of skimming the RTF::Writer documentation, we can see that to send output to an RTF file, you create a sort of file handle for it, and then send data to it via its print or paragraph methods, like so:
use RTF::Writer;
my $rtf = RTF::Writer->new_to_file( "sample.rtf" );
$rtf->prolog(); # sets up sane defaults
$rtf->paragraph( "Hello world!" );
$rtf->close;
That writes an RTF document consisting of just a sane header and then basically the text, "Hello world!":
{\rtf1\ansi\deff0{\fonttbl
{\f0 \froman Times New Roman;}}
{\colortbl;\red255\green0\blue0;\red0\green0\blue255;}
{\pard
Hello world!
\par}
}
The RTF::Writer documentation comes with a list of some basic escape codes that are basically all we need to format our lexicon. The notables are:
\b for bold
\i for italic
\f2 switch to font #2 (i.e., the second font we declare for this document)
\fs40 switch text size to 20-point (40 = how many half-points)
RTF::Writer's interface is designed so that normal text passed to it will get escaped before being written to the RTF output file, and clearly you don't want that to happen to these codes -- you want the \b to be written as is, not escaped so that it'd show a literal backslash and a literal b in the document. To signal this to the RTF::Writer interface, you pass references to these strings, like so:
$rtf->paragraph( \'\i', "Hello world!" );
You can also limit the effect of a code by wrapping it in an arrayref, i.e., with [code, text], like so:
$rtf->paragraph(
"And ",
[ \'\i', "Hello world!" ],
" is what I say."
);
That'll produce a document saying: And Hello world! is what I say.
That's just about all the RTF we'd need to know to produce some simple lexicon output. We can exercise this with some literal text:
use RTF::Writer;
my $rtf = RTF::Writer->new_to_file( "lex.rtf" );
$rtf->prolog(); # sets up sane defaults
$rtf->paragraph(
[ \'\b', "tlak'aláang: " ],
[ \'\b\i', "n." ],
" the shelter of a tree"
);
$rtf->paragraph(
[ \'\b', "anáa: " ],
[ \'\b\i', "adv." ],
" inside a house; at home"
);
$rtf->close;
And that gets us something very close to the kind of formatting you'd find in a typical fancy dictionary:

Of course, we'd like to tweak spacing and fonts a bit, but that can be left for later as just minor additions to the code. Knowing just as much as we do now, we can see the output code taking shape. It would be something like:
foreach my $entry (...) {
...
$rtf->paragraph(
[ \'\b', $headword, ": " ],
[ \'\b\i', $part_of_speech ],
" ", $english,
...and something to drop the example sentences, if any...
);
}
In fact, we can already cobble this together with our earlier input-reading code, to make a clunky but working prototype:
use strict;
use Text::Shoebox::Lexicon;
my $lex = Text::Shoebox::Lexicon->read_file( "haida.sf" );
use RTF::Writer;
my $rtf = RTF::Writer->new_to_file( "lex.rtf" );
$rtf->prolog(); # sets up sane defaults
foreach my $entry ($lex->entries) {
my %e = $entry->as_list;
$rtf->paragraph(
[ \'\b', $e{'hw'} || "?hw?", ": " ],
[ \'\b\i', $e{'pos'} || "?pos?" ],
" ", $e{'engl'} || "?english?"
);
}
$rtf->close;
And that produces this:

Now, sure, the entries aren't in alphabetical order, we see "noun" instead of "n.", and the example sentences aren't in there yet. But consider that with not even twenty lines of Perl, we've got a working dictionary renderer. It's downhill from here.
[1] [2] [3] Next