~ubuntu-branches/ubuntu/hardy/dbacl/hardy

« back to all changes in this revision

Viewing changes to NEWS

Committer: Bazaar Package Importer
Author(s): Zak B. Elep
Date: 2006-03-26 22:35:35 UTC
mto: (2.1.1 etch) (1.1.2 upstream)
mto: This revision was merged to the branch mainline in revision 4.
Revision ID: james.westby@ubuntu.com-20060326223535-icwiulpkzesds4mq

Import upstream version 1.12

files added:
TREC

TREC/Makefile.am

TREC/Makefile.in

TREC/OPTIONS

TREC/OPTIONS.TREC2005.1cefhuj

TREC/OPTIONS.TREC2005.2adphu

TREC/OPTIONS.TREC2005.3adphd

TREC/OPTIONS.TREC2005.4adp

TREC/OPTIONS.adp-dir-d

TREC/OPTIONS.adp-u-d

TREC/OPTIONS.adp-unif-d

TREC/OPTIONS.bi-adp-unif-d

TREC/OPTIONS.bi-simple-d

TREC/OPTIONS.cef-dir-d

TREC/OPTIONS.cef-unif-d

TREC/OPTIONS.puretext-d

TREC/OPTIONS.simple-d

TREC/OPTIONS.simple-v

TREC/README

TREC/SFX

TREC/TREC2005.txt

TREC/basic-email

TREC/classify

TREC/finalize

TREC/initialize

TREC/train

TREC/verify-stderr

config/depcomp

contrib

contrib/Makefile.am

contrib/Makefile.in

contrib/README

contrib/clint_adams-patch-dbacl-1.9.gz

doc/chess

doc/chess/Makefile.am

doc/chess/Makefile.in

doc/chess/combine_half_moves.sh

doc/chess/csfpc1.png

doc/chess/csfpc2.png

doc/chess/csfpc3.png

doc/chess/dce-1.sh

doc/chess/dce-2.sh

doc/chess/dce-3.sh

doc/chess/dce-basic.sh

doc/chess/dce.sh

doc/chess/down.png

doc/chess/randomizer.awk

doc/chess/renorm.awk

doc/chess/spam_chess.html

doc/chess/spoiler.png

doc/is_it_working.html

man/hypex.1in

src/hypex.c

src/hypex.h

src/lint-check.sh

src/splintrc

src/tests/email-style.shin

src/tests/sample.spam-11

src/tests/verify.email-style

files removed:
src/stamp-h.in

files modified:
COPYING

ChangeLog

INSTALL

Makefile.am

Makefile.in

NEWS

aclocal.m4

config/config.guess

config/config.sub

configure

configure.in

doc/Makefile.am

doc/Makefile.in

doc/email.html

doc/tutorial.html

man/Makefile.am

man/Makefile.in

man/dbacl.1in

man/hmine.1in

src/Makefile.am

src/Makefile.in

src/bayesol.c

src/catfun.c

src/config.h.in

src/dbacl.c

src/dbacl.h

src/fh.c

src/hmine.c

src/hmine.h

src/hparse.c

src/icheck.c

src/mailcross.in

src/mailinspect.c

src/mbw.c

src/mbw.h

src/probs.c

src/rfc2822.c

src/rfc822.c

src/risk-lexer.c

src/risk-parser.c

src/risk-parser.h

src/risk-parser.y

src/tests/Makefile.am

src/tests/Makefile.in

src/tests/dbacl-a.shin

src/tests/dbacl-g.shin

src/tests/dbacl-jap.shin

src/tests/dbacl-o.shin

src/tests/email-badmime1.shin

src/tests/email-badmime2.shin

src/tests/email-forms.shin

src/tests/email-headers.shin

src/tests/email-l.shin

src/tests/email-maildir.shin

src/tests/email-mbox.shin

src/tests/email-pgp.shin

src/tests/email-scripts.shin

src/tests/email-theaders.shin

src/tests/email-uri.shin

src/tests/email-uu.shin

src/tests/email-xheaders.shin

src/tests/html-alt.shin

src/tests/html-links.shin

src/tests/html.shin

src/tests/icheck.shin

src/tests/lscheck.shin

src/tests/model-sum1.shin

src/tests/pcheck-2821b.shin

src/tests/pcheck-2821g.shin

src/tests/pcheck-2822b.shin

src/tests/pcheck-2822g.shin

src/tests/pcheck-821b.shin

src/tests/pcheck-821g.shin

src/tests/pcheck-822b.shin

src/tests/pcheck-822g.shin

src/tests/reservoir.shin

src/tests/score-1.shin

src/tests/score-2.shin

src/tests/shannon-1.shin

src/tests/shannon-2.shin

src/tests/verify.email-forms

src/tests/verify.email-pgp

src/tests/verify.email-scripts

src/tests/verify.email-theaders

src/tests/verify.email-xheaders

src/tests/xml.shin

src/util.c

src/util.h

ts/Makefile.in

ts/dbaclA

ts/dbaclB

ts/dbaclC

ts/dbaclL

Show diffs side-by-side

added added

removed removed

NEWS

dbacl NEWS -- history of user-visible changes. August 2004.

dbacl NEWS -- history of user-visible changes. From August 2004.

dbacl 1.12

Added the "Can spam filters play chess?" essay to the bundled documentation,

look in the doc/chess directory. Added the TREC2005 options files to the

TREC directory. Fixed some parsing bugs.

There now is a new parser "-e char" which parses single characters. This

isn't useful on its own, but together with the -w switch this allows fast

construction of character n-gram models up to order 7. Note that you could

simply use a series of regular expressions to generate n-grams, but this

way doesn't have the regex overhead.

dbacl 1.11

For some reason which appears to be a typo, the signal handling code

was disabled, but now works as advertised.

The score calculations now do renormalization slightly differently,

and document complexities are also changed from integers to reals.

This should be practically unnoticeable for simple models, but for

divergences and complexities of n-gram models it will be, although the

impact is minor asymptotically for large complexity. This change

allows more meaningful direct comparisons between models based on

widely differing tokenization schemes, ie in principle it allows

comparing a category which is based solely on alphabetical word tokens

with another category which is based solely on numbers, for

example, even though they don't compare similar tokens.

Which is not to say that you should do it. You're safe if you

always learn all your categories with exactly the same set of model

switches.

When using the -w switch, complex tokens no longer continue past

the end of a line and onto the next one. This is more consistent

with other switch behaviours, and you can force n-grams to straddle

newlines by using the -S switch.

When using the -o and -m switches together, some extra memory mapping

is now performed. This is useful for keeping the mapped pages

invariant for the TREC tests, but doesn't help in speeding up the

simulation.

DISCUSSION

In the spamjig run [which performs classify/learn for every input

document], after all pages are locked into place, about 90% of the cpu

time is spent optimizing the weights [by contrast, in ordinary use,

about 70% of the running time is reading and parsing input]. The only

way I can see to improve the cpu bottleneck is to exploit symmetries

and compression techniques. However, this can't be done without

changing the learner hash structure, which must be thought through

carefully [and won't be done soon]. As an added benefit, doing this

correctly should imply much reduced memory requirements during

learning.

dbacl 1.10

A new TREC directory contains the necessary scripts and instructions

for running dbacl in the TREC/spam testing framework (spamjig).

The mail body parser was tweaked, so it no longer ignores the preamble

before the first MIME section. This goes against RFC 2046 (p.20)

recommendations, but if a spammer uses it, there's got to be a reason.

So now we also parse the preamble (can be disabled, see

IGNORE_MIME_PREAMBLE).

The -0 switch is now always on by default. Recall that its purpose is

to prevent weight preloading if the category file already

exists. Weight preloading speeds up the learning operation by starting

with the last known set of weights for the category. It's a nice idea,

but can cause trouble if the old category feature list is much different

from the new feature set to be learned. In particular, if you leave

an old category named "dummy" on your system, and months later you decide

to learn an unrelated category also named "dummy"...

Preloading must now be explicitly enabled with the new -1 switch if

you want to experiment with it.

The -g switch now scans a given regular expression for captures

(parentheses), and surrounds the expression with a single capture if

none were found as a convenience. The -g switch is powerful, but hard to

explain:

Many unix tools use regular expressions. Such an expression normally

matches a substring in the input, but if it also contains parentheses,

then whatever is inside those parentheses is "captured". So the

expression 'Hello .*' matches the string "Hello Fred", but the

expression 'Hello (.*)' both matches "Hello Fred" and also captures

"Fred". In dbacl, the -g switch lets you construct tokens from

captured expressions, but a corollary is that if you don't supply a

capture expression, then dbacl won't read any tokens at all! As a convenience,

if no parentheses exist, dbacl will now add some. Thus the command line switch

-g 'Hello .*' is converted to -g '(Hello .*)'

but -g 'Hello (.*)' is left untouched.

100

dbacl 1.9

101

Older »