1
dbacl NEWS -- history of user-visible changes. August 2004.
2
Copyright (C) 2004 Laird Breyer.
1
dbacl NEWS -- history of user-visible changes. From August 2004.
2
Copyright (C) 2004, 2005 Laird Breyer.
6
Added the "Can spam filters play chess?" essay to the bundled documentation,
7
look in the doc/chess directory. Added the TREC2005 options files to the
8
TREC directory. Fixed some parsing bugs.
10
There now is a new parser "-e char" which parses single characters. This
11
isn't useful on its own, but together with the -w switch this allows fast
12
construction of character n-gram models up to order 7. Note that you could
13
simply use a series of regular expressions to generate n-grams, but this
14
way doesn't have the regex overhead.
18
For some reason which appears to be a typo, the signal handling code
19
was disabled, but now works as advertised.
21
The score calculations now do renormalization slightly differently,
22
and document complexities are also changed from integers to reals.
23
This should be practically unnoticeable for simple models, but for
24
divergences and complexities of n-gram models it will be, although the
25
impact is minor asymptotically for large complexity. This change
26
allows more meaningful direct comparisons between models based on
27
widely differing tokenization schemes, ie in principle it allows
28
comparing a category which is based solely on alphabetical word tokens
29
with another category which is based solely on numbers, for
30
example, even though they don't compare similar tokens.
31
Which is not to say that you should do it. You're safe if you
32
always learn all your categories with exactly the same set of model
35
When using the -w switch, complex tokens no longer continue past
36
the end of a line and onto the next one. This is more consistent
37
with other switch behaviours, and you can force n-grams to straddle
38
newlines by using the -S switch.
40
When using the -o and -m switches together, some extra memory mapping
41
is now performed. This is useful for keeping the mapped pages
42
invariant for the TREC tests, but doesn't help in speeding up the
47
In the spamjig run [which performs classify/learn for every input
48
document], after all pages are locked into place, about 90% of the cpu
49
time is spent optimizing the weights [by contrast, in ordinary use,
50
about 70% of the running time is reading and parsing input]. The only
51
way I can see to improve the cpu bottleneck is to exploit symmetries
52
and compression techniques. However, this can't be done without
53
changing the learner hash structure, which must be thought through
54
carefully [and won't be done soon]. As an added benefit, doing this
55
correctly should imply much reduced memory requirements during
60
A new TREC directory contains the necessary scripts and instructions
61
for running dbacl in the TREC/spam testing framework (spamjig).
63
The mail body parser was tweaked, so it no longer ignores the preamble
64
before the first MIME section. This goes against RFC 2046 (p.20)
65
recommendations, but if a spammer uses it, there's got to be a reason.
66
So now we also parse the preamble (can be disabled, see
67
IGNORE_MIME_PREAMBLE).
69
The -0 switch is now always on by default. Recall that its purpose is
70
to prevent weight preloading if the category file already
71
exists. Weight preloading speeds up the learning operation by starting
72
with the last known set of weights for the category. It's a nice idea,
73
but can cause trouble if the old category feature list is much different
74
from the new feature set to be learned. In particular, if you leave
75
an old category named "dummy" on your system, and months later you decide
76
to learn an unrelated category also named "dummy"...
78
Preloading must now be explicitly enabled with the new -1 switch if
79
you want to experiment with it.
81
The -g switch now scans a given regular expression for captures
82
(parentheses), and surrounds the expression with a single capture if
83
none were found as a convenience. The -g switch is powerful, but hard to
86
Many unix tools use regular expressions. Such an expression normally
87
matches a substring in the input, but if it also contains parentheses,
88
then whatever is inside those parentheses is "captured". So the
89
expression 'Hello .*' matches the string "Hello Fred", but the
90
expression 'Hello (.*)' both matches "Hello Fred" and also captures
91
"Fred". In dbacl, the -g switch lets you construct tokens from
92
captured expressions, but a corollary is that if you don't supply a
93
capture expression, then dbacl won't read any tokens at all! As a convenience,
94
if no parentheses exist, dbacl will now add some. Thus the command line switch
96
-g 'Hello .*' is converted to -g '(Hello .*)'
97
but -g 'Hello (.*)' is left untouched.