Managing Rich Data Structures
by Dave Baker
February 16, 2006
If you're like me, you've written plenty of scripts that use simple text files to store snippets of data. Those scripts might have evolved over time into using several snippets of data for each item, which translates into lots and lots of little text files in a data directory somewhere.
After reading that Linux doesn't like more than a hundred or so text files per directory, and thinking about the amount of space wasted on my hard drive due to the small size of the snippets compared to the size of a sector and the hassle of all those little files when making a backup, I decided to move from snippets to a single database. Here's how I did it.
I didn't go all the way to a relational database, in part because I'm not a very proficient data-slinger. Plus I wanted to try the apparently simpler technique described in Chapter 14 of Perl Cookbook, 2nd Edition, namely the use of one of the DBM libraries. No SQL required.
Also, I didn't want to use a one-line-per-item type of plain text file, although I've had quite a bit of luck with them in other projects. That's where you have a list of different values separated by a pipe character or some other special symbol (rather than commas). Each item might have a unique identification number as the item in the first field, for example. Then you can read through the lines in the text file until you find the line that has the ID number you desire in the first field.
The reason the one-line-per-item delimited text file didn't seem ideal for my project is that some of the items of data consist of text that includes line breaks. If someone inserts literal line breaks into a field, you lose the ability to easily search for particular fields by position. The example data here doesn't include such data; I've omitted the multi-line data for the sake of compactness in demonstrating the MLDBM solution. Happily, there is nothing that prevents you from storing multiline data with MLDBM.
My three kinds of data were:
- A text file that stored the target URL of an advertiser. When a reader clicks on the banner, a
mod_perl script takes the reader there.
- A text file that stored the URL of the .gif or .jpg file to use as the banner.
- A text file that stored a one-line headline to display above the banner.
Each file's name indicated the date of its associated banner (the banners appear in a daily newsletter published on the Web and via email each day) and the type of data stored in the file. For example, url_2005_12_09.txt, gif_2005_12_09.txt, and headline_2005_12_09.txt are the three data files for the December 9, 2005 newsletter's banner.
Here's how I turned those three data files (multiplied by the number of banner slots sold to date and the number of banner slots sold for upcoming newsletters) into a single file.
A Hash of Hashes
First, I thought about what kind of data structure I would create. I looked at the relationships between the various text files. It became clear that I basically had a lot of hashes: each banner's data consists of a set of keys and values. I had been creating a separate text file that essentially contained the key in its name and the value in its data. Each banner's data would fit nicely into a hash having three keys and three associated values.
I thought about storing this bunch of hashes in an array, but then realized that an array of hashes would not let me access a particular day's data easily--the hashes would be in the order in which I saved them into the array, but that wouldn't translate easily to particular newsletter dates. Would $array[8] be the hash for December 8, 2005's banner? What happens in January of next year? Should I put the New Year's Day banner into $array[32]?
Hmmm. What kind of data structure associates a unique key, such as the date of a particular banner, with its value, such as the three different kinds of data and their values per banner? A hash, of course! I would create a hash of hashes.
The name I chose for the "parent" hash is %data_for_ad_on. The keys will be the dates of the ads, so the use of an ending preposition in the name of the hash leads to a more natural-reading and meaningful variable name. The key for the data for the December 8, 2005 banner will be 2005_12_08, for example, and the way to access the value associated with that key is $data_for_ad_on{'2005_12_08'}.
How could I store each day's hash--the three named kinds of data and their values--into the mother hash as the value for a particular banner's key (date)?
It's not possible to store a hash directly as the value of a key in another hash; a hash isn't a scalar value. Instead, I turned each day's hash of data keys and values into a reference to that hash. Making such references seemed a bit intimidating at first, but it turned to be fairly easy once I felt comfortable with some new syntax rules.
In Perl, this is how to create the hash of hashes (here I show only two newsletters' worth of data):
%data_for_ad_on = (
'2005_12_09' =>
{
'url' =>
'http://acme.com/index.html', 'gif' =>
'http://myserver.com/banners/acme_banner.gif', 'headline' =>
'Looking for quality, inexpensive widgets? Acme\'s got \'em!',
},
'2005_12_08' = >
{
'url' =>
'http://roadrunners-r-us.com/index.html', 'gif' =>
'http://myserver.com/banners/roadrunners_banner.gif', 'headline' =>
'Looking for inexpensive deliveries? Roadrunners R Us has \'em!',
},
);
[1] [2] [3] Next