1
by Clint Adams
Import upstream version 1.9 |
1 |
\" t |
2 |
.TH MAILFOOT 1 "Bayesian Text Classification Tools" "Version @VERSION@" "" |
|
3 |
.SH NAME |
|
4 |
mailfoot \- a full-online-ordered-training simulator for use with dbacl. |
|
5 |
.SH SYNOPSIS |
|
6 |
.HP |
|
7 |
.B mailfoot |
|
8 |
.I command |
|
9 |
[
|
|
10 |
.I command_arguments |
|
11 |
]
|
|
12 |
.SH DESCRIPTION |
|
13 |
.PP |
|
14 |
.B mailfoot |
|
15 |
automates the task of testing email filtering and classification |
|
16 |
programs such as |
|
17 |
.BR dbacl (1). |
|
18 |
Given a set of categorized documents, mailfoot initiates test runs |
|
19 |
to estimate the classification errors and thereby permit fine tuning |
|
20 |
of the parameters of the classifier. |
|
21 |
.PP |
|
22 |
Full Online Ordered Training is a learning method for email classifiers where |
|
23 |
each incoming email is learned as soon as it arrives, thereby always keeping category |
|
24 |
descriptions up to date for the next classification. |
|
25 |
This directly models the way that some email classifiers are used in practice. |
|
26 |
.PP |
|
27 |
FOOT's error rates depend directly on the order in which emails are seen. |
|
28 |
A small change in ordering, as might happen due to networking delays, |
|
29 |
can have an impact on the number of misclassifications. |
|
30 |
Consequently, |
|
31 |
.B mailfoot |
|
32 |
does not give meaningful results, unless the sample emails are chosen carefully. |
|
33 |
However, as this method is commonly used by spam filters, it is still worth |
|
34 |
computing to foster comparisons. Other methods (see |
|
35 |
.BR mailcross (1), mailtoe (1)) |
|
36 |
attempt to capture the behaviour of classification errors in other ways. |
|
37 |
.PP |
|
38 |
To improve and stabilize the error rate calculation, |
|
39 |
.B mailfoot |
|
40 |
performs the FOOT simulations several times on slightly reordered email streams, and |
|
41 |
averages the results. The reorderings occur by multiplexing the emails from each |
|
42 |
category mailbox in random order. Thus if there are three categories, the first email |
|
43 |
classified is chosen randomly from the front of the sample email streams of each type. |
|
44 |
The second email is also chosen randomly among the three types, from the front of the |
|
45 |
streams after the first email was removed. Simulation stops when all sample streams |
|
46 |
are exhausted. |
|
47 |
.PP |
|
48 |
.B mailfoot |
|
49 |
uses the environment variable MAILFOOT_FILTER when |
|
50 |
executing, which permits the simulation of arbitrary filters, provided |
|
51 |
these satisfy the compatibility conditions stated in the |
|
52 |
ENVIRONMENT section below. |
|
53 |
.PP |
|
54 |
For convenience, |
|
55 |
.B mailfoot |
|
56 |
implements a |
|
57 |
.B testsuite |
|
58 |
framework with predefined wrappers for several open |
|
59 |
source classifiers. This permits the direct comparison of |
|
60 |
.BR dbacl (1) |
|
61 |
with competing classifiers on the same set of email samples. See the USAGE section below. |
|
62 |
.PP |
|
63 |
During preparation, |
|
64 |
.B mailfoot |
|
65 |
builds a subdirectory named mailfoot.d in the current working directory. |
|
66 |
All needed calculations are performed inside this subdirectory. |
|
67 |
.SH EXIT STATUS |
|
68 |
.B mailfoot |
|
69 |
returns 0 on success, 1 if a problem occurred. |
|
70 |
.SH COMMANDS |
|
71 |
.PP |
|
72 |
.PP |
|
73 |
.IP "\fBprepare\fR \fIsize\fR" |
|
74 |
Prepares a subdirectory named mailfoot.d in the current working directory, and |
|
75 |
populates it with empty subdirectories for exactly |
|
76 |
.I size |
|
77 |
subsets. |
|
78 |
.IP "\fBadd\fR \fIcategory\fR [ \fIFILE\fR ]..." |
|
79 |
Takes a set of emails from either FILE if specified, or STDIN, and |
|
80 |
associates them with |
|
81 |
.IR category . |
|
82 |
The ordering of emails within \fIFILE\fR is preserved, and subsequent \fIFILE\fRs are appended |
|
83 |
to the first in each category. |
|
84 |
This command can be repeated several times, |
|
85 |
but should be executed at least once. |
|
86 |
.IP "\fBclean\fR" |
|
87 |
Deletes the directory mailfoot.d and all its contents. |
|
88 |
.IP "\fBrun\fR" |
|
89 |
Multiplexes randomly from the email streams added earlier, and relearns |
|
90 |
categories only when a misclassification occurs. The simulation is repeated |
|
91 |
.I size |
|
92 |
times. |
|
93 |
.IP "\fBsummarize\fR" |
|
94 |
Prints average error rates for the simulations. |
|
95 |
.IP "\fBplot\fR [ \fIps\fR | \fIlogscale\fR ]..." |
|
96 |
Plots the number of errors over simulation time. The "ps" option, if present, |
|
97 |
writes the plot to a postscript file in the directory mailfoot/plots, instead of |
|
98 |
being shown on-screen. The "logscale" option, if present, causes the plot to |
|
99 |
be on the log scale for both ordinates. |
|
100 |
.IP "\fBreview\fR \fItruecat\fR \fIpredcat\fR" |
|
101 |
Scans the last run statistics and extracts all the messages which belong to category |
|
102 |
.I truecat |
|
103 |
but have been classified into category |
|
104 |
.IR predcat . |
|
105 |
The extracted messages are copied to the directory |
|
106 |
.I mailfoot.d/review |
|
107 |
for perusal. |
|
108 |
.PP |
|
109 |
.IP "\fBtestsuite list\fR" |
|
110 |
Shows a list of available filters/wrapper scripts which can |
|
111 |
be selected. |
|
112 |
.IP "\fBtestsuite select\fR [ \fIFILTER\fR ]..." |
|
113 |
Prepares the filter(s) named |
|
114 |
.I FILTER |
|
115 |
to be used for simulation. The filter name is the name of |
|
116 |
a wrapper script located in the directory |
|
117 |
.IR @PKGDATADIR@/testsuite . |
|
118 |
Each filter has a rigid interface documented below, and the act of selecting |
|
119 |
it copies it to the |
|
120 |
.I mailfoot.d/filters |
|
121 |
directory. Only filters located there |
|
122 |
are used in the simulations. |
|
123 |
.IP "\fBtestsuite deselect\fR [ \fIFILTER\fR ]..." |
|
124 |
Removes the named filter(s) from the directory |
|
125 |
.I mailfoot.d/filters |
|
126 |
so that they are not used in the simulation. |
|
127 |
.IP "\fBtestsuite run\fR [ \fIplots\fR ]" |
|
128 |
Invokes every selected filter on the datasets added previously, and |
|
129 |
calculates misclassification rates. If the "plots" option is present, |
|
130 |
each filter simulation is plotted as a postscript file in the directory |
|
131 |
.IR mailfoot.d/plots . |
|
132 |
.IP "\fBtestsuite status\fR" |
|
133 |
Describes the scheduled simulations. |
|
134 |
.IP "\fBtestsuite summarize\fR" |
|
135 |
Shows the cross validation results for all filters. Only makes sense |
|
136 |
after the |
|
137 |
.I run |
|
138 |
command. |
|
139 |
.SH USAGE |
|
140 |
.PP |
|
141 |
The normal usage pattern is the following: first, you should separate your email |
|
142 |
collection into several categories (manually or otherwise). Each category should |
|
143 |
be associated with one or more folders, but each folder should not contain |
|
144 |
more than one category. Next, you should decide how many runs to use, say 10. |
|
145 |
The more runs you use, the better the predicted error rates. However, more runs take more time. |
|
146 |
Now you can type |
|
147 |
.HP |
|
148 |
.na |
|
149 |
% mailfoot prepare 10 |
|
150 |
.ad |
|
151 |
.PP |
|
152 |
Next, for every category, you must add every folder associated with this |
|
153 |
category. Suppose you have three categories named |
|
154 |
.IR spam , |
|
155 |
.IR work , |
|
156 |
and
|
|
157 |
.IR play , |
|
158 |
which are associated with the mbox files |
|
159 |
.IR spam.mbox , |
|
160 |
.IR work.mbox , |
|
161 |
and
|
|
162 |
.IR play.mbox |
|
163 |
respectively. You would type |
|
164 |
.PP |
|
165 |
.na |
|
166 |
% mailfoot add spam spam.mbox |
|
167 |
.br |
|
168 |
% mailfoot add work work.mbox |
|
169 |
.br |
|
170 |
% mailfoot add play play.mbox |
|
171 |
.ad |
|
172 |
.PP |
|
173 |
You should aim for a similar number of emails in each category, as the random |
|
174 |
multiplexing will be unbalanced otherwise. The ordering of the email messages |
|
175 |
in each |
|
176 |
.I *.mbox |
|
177 |
file is important, and is preserved during each simulation. If you repeatedly |
|
178 |
add to the same category, the later mailboxes will be appended to the first, preserving |
|
179 |
the implied ordering. |
|
180 |
.PP |
|
181 |
You can now perform as many FOOT simulations as desired. The multiplexed emails |
|
182 |
are classified and learned one at a time, by executing the command given in the |
|
183 |
environment variable MAILFOOT_FILTER. If not set, a default value is used. |
|
184 |
.PP |
|
185 |
.na |
|
186 |
% mailfoot run |
|
187 |
.br |
|
188 |
% mailfoot summarize |
|
189 |
.ad |
|
190 |
.PP |
|
191 |
The testsuite commands are designed to simplify the above steps and allow comparison |
|
192 |
of a wide range of email classifiers, including but not limited to |
|
193 |
.BR dbacl . |
|
194 |
Classifiers are supported through wrapper scripts, which are located in the |
|
195 |
.I @PKGDATADIR@/testsuite |
|
196 |
directory. |
|
197 |
.PP |
|
198 |
The first stage when using the testsuite is deciding which classifiers to compare. |
|
199 |
You can view a list of available wrappers by typing: |
|
200 |
.PP |
|
201 |
.na |
|
202 |
% mailfoot testsuite list |
|
203 |
.ad |
|
204 |
.PP |
|
205 |
Note that the wrapper scripts are NOT the actual email classifiers, which must |
|
206 |
be installed separately by your system administrator or otherwise. |
|
207 |
Once this is done, you can select one or more wrappers for the simulation |
|
208 |
by typing, for example: |
|
209 |
.PP |
|
210 |
.na |
|
211 |
% mailfoot testsuite select dbaclA ifile |
|
212 |
.ad |
|
213 |
.PP |
|
214 |
If some of the selected classifiers cannot be found on the system, they |
|
215 |
are not selected. Note also that some wrappers |
|
216 |
can have hard-coded category names, e.g. if the classifier only supports binary |
|
217 |
classification. Heed the warning messages. |
|
218 |
.PP |
|
219 |
It remains only to run the simulation. Beware, this can take a long time |
|
220 |
(several hours depending on the classifier). |
|
221 |
.PP |
|
222 |
.na |
|
223 |
% mailfoot testsuite run |
|
224 |
.br |
|
225 |
% mailfoot testsuite summarize |
|
226 |
.ad |
|
227 |
.PP |
|
228 |
Once you are all done, you can delete the working files, log |
|
229 |
files etc. by typing |
|
230 |
.PP |
|
231 |
.na |
|
232 |
% mailfoot clean |
|
233 |
.ad |
|
234 |
.SH SCRIPT INTERFACE |
|
235 |
.PP |
|
236 |
.B mailfoot testsuite |
|
237 |
takes care of learning and classifying your prepared email corpora for each |
|
238 |
selected classifier. Since classifiers have widely varying interfaces, this |
|
239 |
is only possible by wrapping those interfaces individually into a standard |
|
240 |
form which can be used by |
|
241 |
.BR "mailfoot testsuite" . |
|
242 |
.PP |
|
243 |
Each wrapper script is a command line tool which accepts a single command |
|
244 |
followed by zero or more optional arguments, in the standard form: |
|
245 |
.PP |
|
246 |
.na |
|
247 |
wrapper command [argument]... |
|
248 |
.ad |
|
249 |
.PP |
|
250 |
Each wrapper script also makes use of STDIN and STDOUT in a well defined |
|
251 |
way. If no behaviour is described, then no output or input should be used. |
|
252 |
The possible commands are described below: |
|
253 |
.IP filter |
|
254 |
In this case, a single email is expected on STDIN, and a list of |
|
255 |
category filenames is expected in $2, $3, etc. The script writes the |
|
256 |
category name corresponding to the input email on STDOUT. No trailing newline |
|
257 |
is required or expected. |
|
258 |
.IP learn |
|
259 |
In this case, a standard mbox stream is expected on STDIN, while a |
|
260 |
suitable category file name is expected in $2. No output is written to |
|
261 |
STDOUT. |
|
262 |
.IP clean |
|
263 |
In this case, a directory is expected in $2, which is examined for old |
|
264 |
database information. If any old databases are found, they are purged or |
|
265 |
reset. No output is written to STDOUT. |
|
266 |
.IP describe |
|
267 |
IN this case, a single line of text is written to STDOUT, describing the filter's |
|
268 |
functionality. The line should be kept short to prevent line wrapping on a terminal. |
|
269 |
.IP bootstrap |
|
270 |
In this case, a directory is expected in $2. The wrapper script first checks for |
|
271 |
the existence of its associated classifier, and other prerequisites. If the |
|
272 |
check is successful, then the wrapper is cloned into the supplied directory. |
|
273 |
A courtesy notification should be given on STDOUT to express success or failure. |
|
274 |
It is also permissible to give longer descriptions caveats. |
|
275 |
.IP toe |
|
276 |
Used by |
|
277 |
.BR mailtoe (1). |
|
278 |
.IP foot |
|
279 |
In this case, a list of categories is expected in $3, $4, etc. Every possible |
|
280 |
category must be listed. Preceding this list, the true category is given in $2. |
|
281 |
.SH ENVIRONMENT |
|
282 |
.PP |
|
283 |
Right after loading, |
|
284 |
.B mailfoot |
|
285 |
reads the hidden file .mailfootrc in the $HOME directory, if it exists, so |
|
286 |
this would be a good place to define custom values for environment variables. |
|
287 |
.IP MAILFOOT_FILTER |
|
288 |
This variable contains a shell command to be executed repeatedly |
|
289 |
during the running stage. |
|
290 |
The command should accept an email message on STDIN and output a |
|
291 |
resulting category name. On the command line, it should also accept |
|
292 |
first the true category name, then a list of all possible category |
|
293 |
file names. If the output category does not match the true category, |
|
294 |
then the relevant categories are assumed to have been silently |
|
295 |
updated/relearned. |
|
296 |
If MAILFOOT_FILTER is undefined, |
|
297 |
.B mailfoot |
|
298 |
uses a default value. |
|
299 |
.IP TEMPDIR |
|
300 |
This directory is exported for the benefit of wrapper scripts. Scripts which |
|
301 |
need to create temporary files should place them a the location given in TEMPDIR. |
|
302 |
.SH NOTES |
|
303 |
.PP |
|
304 |
The subdirectory mailfoot.d can grow quite large. It |
|
305 |
contains a full copy of the training corpora, as well as learning files for |
|
306 |
.I size |
|
307 |
times all the added categories, and various log files. |
|
308 |
.PP |
|
309 |
FOOT simulations for |
|
310 |
.BR dbacl (1) |
|
311 |
are very, very slow (order n squared) and will take all night to perform. This is not easy to improve. |
|
312 |
.SH WARNING |
|
313 |
.PP |
|
314 |
Because the ordering of emails within the added mailboxes matters, the estimated |
|
315 |
error rates are not well defined or even meaningful in an objective sense. |
|
316 |
However, if the sample emails represent an actual snapshot of a user's incoming email, |
|
317 |
then the error rates are somewhat meaningful. The simulations can then be interpreted |
|
318 |
as alternate realities where a given classifier would have intercepted the incoming mail. |
|
319 |
.SH SOURCE |
|
320 |
.PP |
|
321 |
The source code for the latest version of this program is available at the |
|
322 |
following locations: |
|
323 |
.PP |
|
324 |
.na |
|
325 |
http://www.lbreyer.com/gpl.html |
|
326 |
.br |
|
327 |
http://dbacl.sourceforge.net |
|
328 |
.ad |
|
329 |
.SH AUTHOR |
|
330 |
.PP |
|
331 |
Laird A. Breyer <laird@lbreyer.com> |
|
332 |
.SH SEE ALSO |
|
333 |
.PP |
|
334 |
.BR bayesol (1) |
|
335 |
.BR dbacl (1), |
|
336 |
.BR mailcross (1), |
|
337 |
.BR mailinspect (1), |
|
338 |
.BR mailtoe (1), |
|
339 |
.BR regex (7) |
|
340 |