box-stuffer
- database intermediary for MHonArc archives
This file documents box-stuffer version .9
# box-stuffer expects arguments containing one or more of MHonArc's OUTDIR # directories, containing a MHonArc database file (.mhonarc.db).
# process a single mhonarc archive $ perl box-stuffer.pl ~/html/lists/widget-discuss/1999/12/
# process a year of archives $ perl box-stuffer.pl ~/html/lists/widget-discuss/1999/[0-1][0-9]/
# process a decade $ perl box-stuffer.pl ~/html/lists/widget-discuss/199*/[0-1][0-9]/
MHonArc is an immensely handy tool to convert mail to browsable HTML, operating on discrete temporal chunks of mail called archives. Individual MHonArc archives do not know anything about each other, which complicates certain desirable feats of informational prestidigitation.
box-stuffer raids the per-archive MHonArc database file to populate a single, shared repository of message attributes within a SQL DBMS. Information stored includes the message's subject, author, date, RFC822 Message-ID, and full path to HTML representation on disk. (Message bodies are not retrieved, nor is this feature planned.)
A single store of message meta-data is potentially useful for a variety of purposes. For example, you can:
Banish the problem of individual message URLs changing position after a MHonArc archive rebuild, by creating a permanent link scheme based on the (permanent) Message-ID, chasing the (shifting) filename
Allow searches across the whole or a subset of your MHonArc archives, whether by author (a message's originating address and the associated name, if any, are treated separately) or by text appearing in any message's subject, or by date
A future version of box-stuffer is expected to be able to track threads by their component messages, which opens up the possibility of some surpassingly nifty and heretofore very tricky effects.
box-stuffer 1.0 will be release-quality, and all kinds of things may change between now and then.
As this tool is still in development, it is not yet intended for general consumption. The only people expected to be interested in this version are those whose need for a meta-index have compelled them to build or contemplate similar solutions.
That said, box-stuffer requires:
Perl 5.004 or later
MHonArc 2.4.0 or later
Digest::MD5 Perl module
When the MD5 module is available, MHonArc 2.4.0 or better generates unique, reliable Message-IDs for messages which do not come so equipped (such as messages culled from a digest). box-stuffer can't work reliably on messages without Message-IDs.
DBI Perl module
A DBI-compliant SQL server (such as MySQL, PostgreSQL, or Oracle)
You will naturally require regular access to a database account with write access (for updates to the data), read access (for queries), and at least once you'll need an account with full access, to create the tables.
box-stuffer 1.0 has been tested only with MySQL.
Text::Template Perl module
http://search.cpan.org/search?dist=Text-Template
Required by the sample CGI.
This section describes features not yet implemented:
Collision behavior isn't there yet. Right now, if it finishes a run with bounced messages pending, it will merely print their vital statistics to STDOUT.
Getting up and running requires an understanding of three ideas: namespaces, session-tracking, and collision behavior.
If you archive several lists on a given subject, you're going to see single messages sent to multiple lists. box-stuffer can't tolerate duplicates -- the second message to be added to the database with a given Message-ID would be treated as an attempt to update (and replace) the first. Namespaces solve this problem, by optionally creating named compartments for Message-IDs which would otherwise conflict.
For example, if a single message is cc'd to both widget-announce and widget-discuss, and neither list has a namespace defined (or they share a common ``Widgets'' namespace), then their Message-IDs will conflict, and determining which message will survive in the database is just a matter of the Collision Behavior setting (see below).
If you intend for both copies of the message to exist independently within the database -- which makes sense, as they were received independently and are part of different thread structures -- you're going to have to provide each list with a unique namespace. ``widget-announce'' and ``widget-discuss'' are the obvious choices, mirroring reality, but you might have reason to define namespaces by the year, or spanning more than just a single list.
Namespaces are applied to MHonArc archives, and are implemented as custom MHonArc resource variables. Specifying a namespace for a given archives requires that you modify the mechanism by which MHonArc is called on your system. For example, instead of
mhonarc file -outdir "~/html/lists/widgets/"
you might use
mhonarc file -definevar NAMESPACE='widget-announce' -outdir "~/html/lists/widgets/"
For more information on custom resources, see
http://www.mhonarc.org/MHonArc/doc/resources/definevar.html
box-stuffer's session-tracking mechanism tags all messages collected from a given archive with a unique identifier, effectively labeling the batch in which messages were inserted.
Archive maintainers will regularly want to re-scan a MHonArc archive which has already been partially or completely committed by box-stuffer. Under most circumstances, this is far and away the most common occurrence of box-stuffer encountering what looks like a duplicate message (where another message with the same Message-ID and namespace are already present in the database).
We operate under an assumption that if a run generates a sufficiently improbable number of bounce warnings, against messages that were all entered under a common session tag, then it's plausible to assume that the messages currently being added are just a more recent version of some previously-registered archive. (If multiple archives are being processed en-masse, unique session tags are generated and applied to each archive.)
So every time an incoming message bounces against a message already in the DB, box-stuffer notes the session tag of the message sitting the database. Once a threshold value (the default is five) of incoming messages bounce off preexisting messages belonging to the same session, two things will happen:
all previously-stored messages belonging to the session tag whose threshold was met will be deleted
the pending bounced messages, which contributed to the accumulation of the threshold, are immediately inserted into the database.
Session-tracking is useful and straightforward if you ever remove messages from a MHonArc archive. The next time you run box-stuffer on that archive, as soon as the threshold is met, all previously stored messages will be pulled, and the only messages to be re-inserted will be those currently reflected in the archive.
Once every ten or twenty thousand messages you might come across a genuine, unintended duplicate Message-ID. These will appear to box-stuffer as insufficient attempts to meet a session-ID threshold, and will be handled at the end of the run.
Exactly what happens to these wayward messages is defined under the Collision Behavior setting. There are three options:
Reject
Messages which already exist in the database will not be replaced, barring any mass deletions triggered by session-tracking. Any messages lingering at the end of a run, too few to trigger the session-tracking threshold, will be silently ignored.
Replace
All lingering messages at the end of a run will be silently added to the database, clobbering what came before. This is dangerous, and is only planned out of some misguided attempt at symmetry. If I don't hear that this is actively useful to someone, I'll probably remove it in a future version.
Ask
This is the default. If box-stuffer is being run interactively (which it determines by checking to see if both STDIN and STDOUT are directed to a tty) it will ask at the shell if a given pending message should take precedence over an existing message. If box-stuffer is being invoked by cron or its output is being similarly redirected, it will describe to STDOUT the new and old messages, under the assumption that the output is being mailed or otherwise delivered to the attention of the appropriate admin.
These tables need to be created manually before box-stuffer's run.
CREATE TABLE M2H_messages ( date datetime DEFAULT '0000-00-00 00:00:00' NOT NULL, filename varchar(80) DEFAULT '' NOT NULL, message_id varchar(100) DEFAULT '' NOT NULL, subject varchar(80) DEFAULT '' NOT NULL, author_id int(11) DEFAULT '0' NOT NULL, namespace_id int(11) NOT NULL, session_id int(11) NOT NULL, PRIMARY KEY (message_id, namespace_id), KEY messages (subject), KEY messages1 (author_id), KEY messages2 (namespace_id), KEY messages3 (session_id), KEY messages4 (message_id) );
CREATE TABLE M2H_authors ( author_id int(11) DEFAULT '0' NOT NULL auto_increment, author_name varchar(80) DEFAULT '' NOT NULL, author_email varchar(80) DEFAULT '' NOT NULL, PRIMARY KEY (author_id), KEY authors (author_name, author_email), UNIQUE KEY authors1 (author_email) );
CREATE TABLE M2H_namespaces ( namespace_id int(11) DEFAULT '0' NOT NULL auto_increment, namespace_name varchar(40) NOT NULL, PRIMARY KEY (namespace_id), UNIQUE KEY namespaces (namespace_name) );
CREATE TABLE M2H_sessions ( session_id int(11) DEFAULT '0' NOT NULL auto_increment, session_name varchar(60) NOT NULL, PRIMARY KEY (session_id), UNIQUE KEY sessions (session_name) );
(does M2H_authors.author_email really need to be explicitly uniqued?)
box-stuffer was developed completely under MySQL. The transition to other databases will be tricky, and will require gobs of feedback from their users. Rectifying this is a high priority.
In particular, the auto_increment bit in subtable definitions is not portable, and I don't know if the DATETIME datatype is SQL-safe.
I want to explore PostgreSQL soon, and Oracle after that, but other databases are not likely to be tested by me in the foreseeable future. Transactions will be supported in time.
I need to create a getopt interface for setting options like DSN values. It should not be necessary to edit the script directly in 1.0 final, although that may be the preferred way to prevent one's database userID and password from appearing in ps output.
This version of box-stuffer was tested with MHonArc 2.4.4 databases. There's no reason I know of that it shouldn't work back to at least 2.3.3, which is where I started using the product.
Unless someone can explain why it's not a good idea, I think I'm likely to require MHonArc 2.4.0 in code as well as documentation. Future box-stuffer versions may well get even pickier about database versions if problems are uncovered. I realize that this presents the irony of having to renumber one's archives so as to avoid renumbering one's archives; I'm thinking about how to get around this.
The date parsing regex is quite possibly error-prone; I built it up over my archive (~80,000 messages) which is disproportionately based on Mac OS mail clients. I'd like to be informed when it breaks; I may have to switch to Date::Manip.
Everything about the format and verbosity of output is subject to change. I expect that when being run non-interactively, output will be silent.
This documentation is similarly first-cut. Please direct any comments or suggestions to <lexical@bumppo.net>, or the box-stuffer-talk mailing list.
http://lists.sourceforge.net/mailman/listinfo/box-stuffer-talk
thanks!
box-stuffer is made available under the GPL.
http://www.opensource.org/licenses/gpl-license.html
4/9/00 Nat Irons lexical@bumppo.net