The Evolution of Perl Email Handling
by Simon Cozens
|
Mailbox Handling
So much for individual messages; let's move on to handling groups of messages, or folders. We've mentioned Mail::Box already, and this is truly the king of folder handling, supporting local and remote folders, editing folders, and all sorts of other things besides. To use it, we first need a Mail::Box::Manager, which is a factory object for creating Mail::Boxes.
use Mail::Box::Manager
my $mgr = Mail::Box::Manager->new;
Next, we need to open the folder using the manager:
my $folder = $mgr->open(folder => $folder_file);
And now we can get at the individual messages as Mail::Message objects:
for ($folder->messages) {
print $_->subject,"\n";
}
With its more minimalist approach, my favorite mail box manager until recently was Mail::Util's read_mbox function, which takes the name of a Unix mbox file, and returns a list of array references; each reference is the array of lines of a message, suitable for feeding to Mail::Internet->new or similar:
for (read_mbox($folder_file)) {
my $obj = Mail::Internet->new($_);
print $_->head->get("Subject"),"\n";
}
These two are both really handy, but there seemed to be room for something in between the simplicity of Mail::Util and the functionality of Mail::Box, and so the Email Project struck again with Email::Folder and Email::LocalDelivery. Email::Folder handles mbox and maildir folders, with more types planned, and has a reasonably simple interface:
my $folder = Email::Folder->new($folder_file);
for ($folder->messages) {
print $_->header("Subject"),"\n";
}
By default it returns Email::Simple objects for the messages, but this can be changed by subclassing. For instance, if we want raw RFC2822 strings, we can do this:
package Email::Folder::Raw; use base 'Email::Folder';
sub bless_message { my ($self, $rfc2822) = @_; return $rfc2822; }
Perhaps in the future, we will change bless_message to use Email::Abstract->cast to make the representation of messages easier to select without necessarily having to subclass.
The other side of folder handling is writing to a folder, or "local delivery". Email::LocalDelivery was written to assist Email::Filter, of which more later. The problem is harder than it sounds, as it has to deal with locking, escaping mail bodies, and specific problems due to mailbox and maildir formats. LocalDelivery hides all of these things beneath a simple interface:
Email::LocalDelivery->deliver($rfc2822, @mailboxes);
Both Email::LocalDelivery and Email::Folder use the Email::FolderType helper module to determine the type of a folder based on its filename.
Address Handling
To come down to a lower level of abstraction again, there are a number of modules for handling email addresses. The old favorite is Mail::Address. A mail address appearing in the fields of an email can be made up of several elements: the actual address, a phrase or name, and a comment. For instance:
Example user <example@example.com> (Not a real user)
Mail::Address parses these addresses, separating out the phrase and comments, allowing you to get at the individual components:
for (Mail::Address->parse($from_line)) {
print $_->name, "\t", $_->address, "\n";
}
Unfortunately, like many of the mail modules, it tries really hard to be helpful.
my ($addr) = Mail::Address->parse('"eBay, Inc." <support@ebay.com>');
print $addr->name # Inc. eBay
Which, while better than the "Inc Ebay" that previous versions would produce, isn't really acceptable. Casey West joined our merry band of renegades and produced Email::Address. It has exactly the same interface as Mail::Address, but it works, and is about twice to three times as fast.
One thing we often want to do when handling mail addresses is to make sure that they're valid. If, for instance, a user is registering for content at a web site, we need to check that the address they've given is capable of receiving mail. Email::Valid, the original inhabitant of the Email:: namespace before our bunch of disaffected squatters moved in, does just this. In its most simple use, we can say:
if (not Email::Valid->address('test@example.com')) {
die "Not a valid address"
}
You can turn on additional checks, such as ensuring there's a valid MX record for the domain, correcting common AOL and Compuserve addressing mistakes, on so on:
if (not Email::Valid->address(-address => 'test@example.com',
-mxcheck => 1)) {
die "Not a valid address"
}
Mail Munging
Once we have our emails, what are we going to do with them? A lot of what I've been looking at has been textual analysis of email, and there are three modules that particularly help with this.
This first is Text::Quoted; it takes the body text of an email message, or any other text really, and tries to figure out which parts of the message are quotations from other messages. It then separates these out into a nested data structure. For instance, if we have
$message = <<EOF
> foo
> # Bar
> baz
quux
EOF
Then running extract($message) will return a data structure like this:
[
[
{ text => 'foo', quoter => '>', raw => '> foo' },
[
{ text => 'Bar', quoter => '> #', raw => '> # Bar' }
],
{ text => 'baz', quoter => '>', raw => '> baz' }
],
{ empty => 1 },
{ text => 'quux', quoter => '', raw => 'quux' }
];
This is extremely useful for highlighting different levels of quoting in different colors when displaying a message. A similar concept is Text::Original, which looks for the start of original, non-quoted content in an email. It knows about many kinds of attribution lines, so with:
$message = <<EOF
You wrote:
> Why are there so many different mail modules?
There's more than one way to do it! Different modules have different
focuses, and operate at different levels; some lower, some higher.
EOF
the first_sentence($message) would be There's more than one way to do it!. The Mariachi mailing list archiver uses this technique to give a "prompt" for each message in a thread.
And speaking of threads, the Mail::Thread module is a Perl implementation of Jamie Zawinski's mail threading algorithm, as used by Mozilla as well as many other mail clients since then. It's also used by Mariachi, and has recently been updated to use Email::Abstract to handle any kind of mail object you want to throw at it:
my $threader = Mail::Thread->new(@mails);
$threader->thread; # Compute threads
for ($threader->rootset) { # Original mails in a thread
dump_thread($_);
}
Mail Filtering
The classic Perl mail filtering tool is Mail::Audit, and I've written articles here about using Mail::Audit on its own (http://www.perl.com/pub/a/2001/07/17/mailfiltering.html) and using it in conjunction with Mail::SpamAssassin (http://www.perl.com/pub/a/2002/03/06/spam.html).
We've mentioned Mail::ListDetector a couple of times already, and I use this with Mail::Audit to do most of the filtering automatically for me. The Mail::Audit::List plugin uses ListDetector to look for mailing list headers in a message; these are things like List-Id, X-Mailman-Version, and the like, which identify a mail as having come through a mailing list. This means I can filter out all mailing list posts to their own folders, like so:
my $list = Mail::ListDetector->new($obj);
if ($list) {
my $name = $list->listname;
$item->accept("mail/$name.-$date");
}
However, Mail::Audit itself is getting a little long in the tooth, and so new installations are encouraged to use the Email Project's Email::Filter instead; it has the same interface for the most part, although not all of the same features, and it uses the new-fangled Email::Simple mail representation for speed and cleanliness.
Mail Mining
Finally, the most high-level thing I do with email is develop frameworks to automatically categorize, organize, and index mail into a database, and attempt to analyze it for interesting nuggets of information.
My first module to do this with was Mail::Miner, which consists of three major parts. The first part takes an email, removes any attachments, and stores the lot in a database. The second looks over the email and runs a set of "Recogniser" modules on it; these find addresses, phone numbers, keywords and phrases, and so on, and store them in a separate database table. The third part is a command-line tool to query the database for mail and information.
For instance, if I need to find Tim O'Reilly's postal address, I ask the query tool, mm, to find addresses in emails from him:
% mm --from "Tim O" --address
Address found in message 1835 from "Tim O'Reilly" <tim@oreilly.com>:
Tim O'Reilly @ O'Reilly & Associates, Inc.
1005 Gravenstein Highway North, Sebastopol, CA 95472
To retrieve the whole email, I'd say
% mm --id 1835
And if it originally contained an attachment, we'd see something like this as part of the email:
[ text/xml attachment something.xml detached - use
mm --detach 208
to recover ]
I paste that middle line mm --detach 208 into a shell, and hey presto, something.xml is written to disk.
Now Mail::Miner is all very well, but having the three ideas in one tight package--filing mail, mining mail, and interfacing to the database--makes it difficult to develop and extend any one of them. And of course, it uses the old-school Mail:: modules.
This brings us to our final module on the mail modules tour, and the most recently released: Email::Store. This is a framework, based on Class::DBI, for storing email in a database and indexing it in various ways:
use Email::Store 'dbi:SQLite:mail.db';
Email::Store->setup;
Email::Store::Mail->store($rfc2822);
And then later...
my ($name) = Email::Store::Name->search( name => "Simon Cozens" )
@mails_from_simon = $name->addressings( role => "From" )->mails;
It can be used to build a mailing list archive tool such as Mariachi, or a data mining setup like Mail::Miner. It's still very much in development, and makes use of a new idea in module extensibility.
I'll be bringing more information when we've written the first mail archiving and searching tool using Email::Store, which I'm going to be doing as a new interface to the Perl mailing lists at perl.org.
Conclusion
We've looked at the major modules for mail handling on CPAN, and there are many more. I am obviously biased towards those which I wrote, and particularly the Perl Email Project modules in the Email::* namespace. These modules are specifically designed to be simple, efficient, and correct, but may not always be a good substitute for the more thorough Mail::* modules, particularly Mail::Box. However, I hope you're now a little more aware of the diversity of mail handling tools out there, and know where to look next time you need to manipulate email with Perl.
Prev [1] [2]