NAME

box-stuffer - database intermediary for MHonArc archives


VERSION

This file documents box-stuffer version .9


SYNOPSIS

 # box-stuffer expects arguments containing one or more of MHonArc's OUTDIR 
 # directories, containing a MHonArc database file (.mhonarc.db).

 # process a single mhonarc archive
 $ perl box-stuffer.pl ~/html/lists/widget-discuss/1999/12/

 # process a year of archives
 $ perl box-stuffer.pl ~/html/lists/widget-discuss/1999/[0-1][0-9]/

 # process a decade
 $ perl box-stuffer.pl ~/html/lists/widget-discuss/199*/[0-1][0-9]/


SHORT DESCRIPTION

MHonArc is an immensely handy tool to convert mail to browsable HTML, operating on discrete temporal chunks of mail called archives. Individual MHonArc archives do not know anything about each other, which complicates certain desirable feats of informational prestidigitation.

box-stuffer raids the per-archive MHonArc database file to populate a single, shared repository of message attributes within a SQL DBMS. Information stored includes the message's subject, author, date, RFC822 Message-ID, and full path to HTML representation on disk. (Message bodies are not retrieved, nor is this feature planned.)

A single store of message meta-data is potentially useful for a variety of purposes. For example, you can:

box-stuffer 1.0 will be release-quality, and all kinds of things may change between now and then.


REQUIREMENTS

As this tool is still in development, it is not yet intended for general consumption. The only people expected to be interested in this version are those whose need for a meta-index have compelled them to build or contemplate similar solutions.

That said, box-stuffer requires:

The Zeroth Step to Enlightenment

This section describes features not yet implemented:

Three Steps to Enlightenment

Getting up and running requires an understanding of three ideas: namespaces, session-tracking, and collision behavior.

Namespaces

If you archive several lists on a given subject, you're going to see single messages sent to multiple lists. box-stuffer can't tolerate duplicates -- the second message to be added to the database with a given Message-ID would be treated as an attempt to update (and replace) the first. Namespaces solve this problem, by optionally creating named compartments for Message-IDs which would otherwise conflict.

For example, if a single message is cc'd to both widget-announce and widget-discuss, and neither list has a namespace defined (or they share a common ``Widgets'' namespace), then their Message-IDs will conflict, and determining which message will survive in the database is just a matter of the Collision Behavior setting (see below).

If you intend for both copies of the message to exist independently within the database -- which makes sense, as they were received independently and are part of different thread structures -- you're going to have to provide each list with a unique namespace. ``widget-announce'' and ``widget-discuss'' are the obvious choices, mirroring reality, but you might have reason to define namespaces by the year, or spanning more than just a single list.

Namespaces are applied to MHonArc archives, and are implemented as custom MHonArc resource variables. Specifying a namespace for a given archives requires that you modify the mechanism by which MHonArc is called on your system. For example, instead of

 mhonarc file -outdir "~/html/lists/widgets/"

you might use

 mhonarc file -definevar NAMESPACE='widget-announce' -outdir "~/html/lists/widgets/"

For more information on custom resources, see

 http://www.mhonarc.org/MHonArc/doc/resources/definevar.html

Session tracking

box-stuffer's session-tracking mechanism tags all messages collected from a given archive with a unique identifier, effectively labeling the batch in which messages were inserted.

Archive maintainers will regularly want to re-scan a MHonArc archive which has already been partially or completely committed by box-stuffer. Under most circumstances, this is far and away the most common occurrence of box-stuffer encountering what looks like a duplicate message (where another message with the same Message-ID and namespace are already present in the database).

We operate under an assumption that if a run generates a sufficiently improbable number of bounce warnings, against messages that were all entered under a common session tag, then it's plausible to assume that the messages currently being added are just a more recent version of some previously-registered archive. (If multiple archives are being processed en-masse, unique session tags are generated and applied to each archive.)

So every time an incoming message bounces against a message already in the DB, box-stuffer notes the session tag of the message sitting the database. Once a threshold value (the default is five) of incoming messages bounce off preexisting messages belonging to the same session, two things will happen:

  1. all previously-stored messages belonging to the session tag whose threshold was met will be deleted

  2. the pending bounced messages, which contributed to the accumulation of the threshold, are immediately inserted into the database.

Session-tracking is useful and straightforward if you ever remove messages from a MHonArc archive. The next time you run box-stuffer on that archive, as soon as the threshold is met, all previously stored messages will be pulled, and the only messages to be re-inserted will be those currently reflected in the archive.

Collision Behavior

Once every ten or twenty thousand messages you might come across a genuine, unintended duplicate Message-ID. These will appear to box-stuffer as insufficient attempts to meet a session-ID threshold, and will be handled at the end of the run.

Exactly what happens to these wayward messages is defined under the Collision Behavior setting. There are three options:

Data Model

These tables need to be created manually before box-stuffer's run.

 CREATE TABLE M2H_messages (
   date datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
   filename varchar(80) DEFAULT '' NOT NULL,
   message_id varchar(100) DEFAULT '' NOT NULL,
   subject varchar(80) DEFAULT '' NOT NULL,
   author_id int(11) DEFAULT '0' NOT NULL,
   namespace_id int(11) NOT NULL,
   session_id int(11) NOT NULL,
   PRIMARY KEY (message_id, namespace_id),
   KEY messages (subject),
   KEY messages1 (author_id),
   KEY messages2 (namespace_id),
   KEY messages3 (session_id),
   KEY messages4 (message_id)
 );

 CREATE TABLE M2H_authors (
   author_id int(11) DEFAULT '0' NOT NULL auto_increment,
   author_name varchar(80) DEFAULT '' NOT NULL,
   author_email varchar(80) DEFAULT '' NOT NULL,
   PRIMARY KEY (author_id),
   KEY authors (author_name, author_email),
   UNIQUE KEY authors1 (author_email)
 );

 CREATE TABLE M2H_namespaces (
   namespace_id int(11) DEFAULT '0' NOT NULL auto_increment,
   namespace_name varchar(40) NOT NULL,
   PRIMARY KEY (namespace_id),
   UNIQUE KEY namespaces (namespace_name)
 );

 CREATE TABLE M2H_sessions (
   session_id int(11) DEFAULT '0' NOT NULL auto_increment,
   session_name varchar(60) NOT NULL,
   PRIMARY KEY (session_id),
   UNIQUE KEY sessions (session_name)
 );

(does M2H_authors.author_email really need to be explicitly uniqued?)

Known Problems

LICENSE

box-stuffer is made available under the GPL.

 http://www.opensource.org/licenses/gpl-license.html

 4/9/00
 Nat Irons
 lexical@bumppo.net