~ubuntu-branches/ubuntu/hardy/dbacl/hardy

The <a href="http://www.lbreyer.com/dbacltut.html">tutorial</a> explains general classification only. It is worth reading (of course :-), but doesn't describe the extra steps necessary

to enable the email functionality. This document describes the necessary switches and caveats.

to enable the email functionality. This document describes the necessary switches and caveats from first principles.

You can learn more about the dbacl suite of utilities (e.g <a href="http://www.lbreyer.com/dbaclman.html">dbacl</a>,

<a href="http://www.lbreyer.com/bayesolman.html">bayesol</a>,

You must make sure that the $HOME/mail/notspam folder doesn't contain any unwanted messages, and

similarly $HOME/mail/spam must not contain any wanted messages. If you mix messages in the two folders, dbacl(1) will be somewhat confused, and its classification accuracy will drop.

If you've used other Bayesian spam filters, you will find that dbacl(1) is slightly different. While other filters can sometimes learn incrementally, one message at a time,

dbacl always learns from scratch all the messages you give it, and only those. dbacl is optimized for learning a large number of messages quickly in one go, and to classify messages as fast as possible afterwards. The author has several good reasons for this choice, which are beyond the scope of this tutorial.

As time goes by, if you use dbacl(1) for classification, you will probably set up your filing system so that messages identified as spam go automatically into the $HOME/mail/notspam folder,

and messages identified as notspam go into the $HOME/mail/notspam folder.

dbacl(1) is far from perfect, and can make mistakes. This will result in messages going to the wrong folder. When dbacl(1) relearns, it will become slightly confused and over time its ability to distinguish spam and notspam will be diminished.

and if you find messages in the wrong folder, you must move them to the correct folder before relearning. If you keep your mail folders clean for learning, dbacl(1) will eventually make very

few mistakes, and you will have plenty of time to inspect the folders once in a while. Or so the theory goes...

Since dbacl(1) must relearn categories from scratch each time, you will

probably want to set up a cron(1) job to relearn your mail folders

every day at midnight. This tutorial explains how to do this below.

If you like, you don't need to relearn periodically at all. The author relearns

his categories once every few months, without noticeable loss. This works

as long as your mail folders contain enough representative emails for training.

Last but not least, if after reading this tutorial you have trouble to get

classifications working, please read <a href="is_it_working.html">is_it_working.html</a>.

<h2>Basic operation: Learning</h2>

To learn your spam category, go to the directory containing your

</pre>

This reads all the messages in $HOME/mail/notspam one at a time, ignoring certain mail

headers and attachments, and also removes HTML markup.

headers and attachments, and also removes HTML markup. The result is a binary

file called spam, which can be used by dbacl for classifications.

This file is a snapshot, it cannot be modified by learning extra mail messages.

(unlike other spam filters, which sometimes let you learn incrementally,

see below for discussion).

If you get warning messages about the hash size being too small,

you need to increase the memory reserved for email tokens. Type:

103

parsed. The man page lists them all, but of particular interest are the -T

104

and -e options.

105

If your email isn't kept in mbox format, you can list each email separately on the command line.

106

If your email isn't kept in mbox format, dbacl(1) can open one or more directories

107

and read all the files in it.

108

For example, if your messages are stored in the directory $HOME/mh/, one file per email,

109

you can type

110

<pre>

% dbacl -T email -l spam $HOME/mh/*

111

% dbacl -T email -l spam $HOME/mh

112

</pre>

Note however that you should be certain that only RFC822 messages are contained in this directory,

and you might run into shell command line limitations if you have a very large number of emails.

Perhaps a better (but not necessarily faster) solution is to temporarily

113

At present, dbacl(1) won't read the subdirectories, or look at the file names

114

to decide whether to read some messages and not others.

115

Another (but not necessarily faster) solution is to temporarily

116

convert your mail into an mbox format file and use that for learning:

117

<pre>

118

% find $HOME/mh -type f | while read f; \

120

</pre>

121

122

It is not enough to learn $HOME/mail/notspam emails, you must also learn the

100

$HOME/mail/notspam emails. dbacl(1) can only choose among the categories it learns. It

101

cannot say that an email is unlike spam, only that an

102

email is like spam or like notspam. To learn

123

$HOME/mail/notspam emails. dbacl(1) can only choose among the categories which have been previously learned. It

124

cannot say that an email is unlike spam (that's an open ended statement), only that an

125

email is like spam or like notspam (these are both concrete statements). To learn

103

126

the notspam category, type:

104

127

<pre>

105

128

% dbacl -T email -l notspam $HOME/mail/notspam

123

146

</pre>

124

147

125

148

All you get is the name of the best category, the email itself is consumed.

126

If you would like to see scores for each category, type:

149

A variation of particular interest is to replace -v by -U, which gives

150

a percentage representing how sure dbacl is of printing the correct category.

151

<pre>

152

% cat email.rfc | dbacl -T email -c spam -c notspam -U

153

notspam # 67%

154

</pre>

155

156

A result of 100% means dbacl is very sure of the printed category, while

157

a result of 0% means at least one other category is equally likely.

158

If you would like to see separate scores for each category, type:

127

159

<pre>

128

160

% cat email.rfc | dbacl -T email -c spam -c notspam -n

129

161

spam 232.07 notspam 229.44

130

162

</pre>

131

163

132

The winning category always has the score closest to zero. In fact, the numbers returned with

133

the -n switch are practically distances towards each category.

164

The winning category always has the smallest score (closest to zero). In fact, the numbers returned with

165

the -n switch can be interpreted as unnormalized distances towards each category from the input document, in a mathematical space of high dimensions.

134

166

If you prefer a return code,

135

167

dbacl(1) returns a positive integer (1, 2, 3, ...)

136

168

identifying the category by its position on the command line. So if you type:

151

183

STDERR. It is always worth rehearsing the operations you intend to script, as

152

184

dbacl(1) will let you know on

153

185

STDERR if it encounters problems during learning. If

154

you ignore warnings, you will likely end up with suboptimal classifications.

186

you ignore warnings, you will likely end up with suboptimal classifications,

187

because the dbacl system prefers to do what it is told predictably,

188

rather than stop when an error condition occurs.

155

189

156

190

Once you are ready for spam filtering, you need to handle two issues.

157

191

158

192

The first issue is when and how to learn.

159

193

160

You should relearn your categories whenever you've received an appreciable number of emails.

161

A category model normally doesn't change dramatically if you add a single new email (provided the

162

original model depends on more than a handful of emails).

163

The simplest strategy is a cron(1) job run once a day:

194

You should relearn your categories whenever you've received an appreciable number of emails or whenever you like. Unlike other spam filters, dbacl cannot

195

learn new emails incrementally and update its category files. Instead, you

196

must keep your messages organized and dbacl(1) will take a snapshot.

197

198

This limitation is actually advantageous in the long run, because it forces you to

199

keep usable archives of your mail and gives you control over every message

200

that is learned. By contrast, with incremental learning you must remember

201

which messages have already been learned, how many times, and whether to unlearn

202

them if you change your mind.

203

204

A dbacl category model normally doesn't change dramatically if you add a single new email (provided the

205

original model depends on more than a handful of emails). Over time, you

206

can even stop learning altogether when your error rate is low enough.

207

The simplest strategy for continual learning is a cron(1) job run once a day:

164

208

<pre>

165

209

% crontab -l > existing_crontab.txt

166

210

</pre>

224

268

225

269

<h2>Advanced operation: Costs</h2>

226

270

271

This section can be skipped. It is here for completeness, but probably

272

won't be very useful to you, especially if you are a new user.

273

227

274

The classification performed by dbacl(1) as described above is known as a MAP

228

275

estimate. The optimal category is chosen only by looking at the email contents. What is missing is your input

229

276

as to the costs of misclassifications.

230

277

278

279

This section is by no means necessary for using dbacl(1) for most

280

classification tasks. It is useful for tweaking dbacl's algorithms only.

281

If you want to improve dbacl's accuracy, first try to learn bigger collections

282

of email.

283

284

231

285

To understand the idea, imagine that an email being wrongly marked spam is likely to be

232

286

sitting in the $HOME/mail/spam folder until you check through it, while an email wrongly marked notspam will prominently appear among your regular correspondence. For most people, the former case can mean a missed timely communication, while the latter case is merely an annoyance.

233

287

286

340

287

341

<h2>Advanced operation: Parsing</h2>

288

342

343

dbacl(1) sets some default switches which should be acceptable

344

for email classification, but probably not optimal. If you like to experiment,

345

then this section should give you enough material to stay occupied, but

346

reading it is not strictly necessary.

347

289

348

When dbacl(1) inspects an email message, it only looks at certain words/tokens.

290

349

In all examples so far,

291

350

the tokens picked up were purely alphabetic words. No numbers are picked up, or special characters such as $, @, % and punctuation.

331

390

332

391

<h2>Cross Validation</h2>

333

392

393

This section explains quality control and accuracy testing, but

394

is not needed for daily use.

395

334

396

If you have time to kill, you might be inspired by the instructions above to write your own learning and filtering shell scripts. For example, you might

335

397

have a script $HOME/mylearner.sh containing

336

398

<pre>

421

483

<a href="http://spambayes.sourceforge.net">spambayes</a>.

422

484

The interface requirements are described in the mailcross(1) manual page.

423

485

486

Note that the supplied wrappers can be sometimes out of date for the most

487

popular Bayesian filters, because these projects can change their interfaces

488

frequently. Also, the wrappers may not use the most flattering combinations

489

of switches and options, as only each filter author knows the best way

490

to use his own filter.

491

424

492

Besides cross validation, you can also test Train On Error and Full Online

425

493

Ordered Training schemes, via the mailtoe(1) and mailfoot(1) commands. Using

426

494

them is very similar to using mailcross(1).

495

496

497

498

The (United States) <a href="http://www.nist.gov">National Institute of Standards and Technology</a> organises an annual conference on text retrieval called

499

<a href="http://trec.nist.gov">TREC</a>, which in 2005 began a new

500

track on spam filtering. A goal of this conference is to develop over

501

several years a set of standard methodologies for evaluating and comparing

502

spam filtering systems.

503

504

For 2005, the initial goal is to compare spam filters in a laboratory

505

environment, not directly connected to the internet. An identical

506

stream of email messages addressed to a single person

507

is shown in chronological order to all

508

participating filters, which can learn them incrementally and must predict

509

the type of each message as it arrives.

510

511

The <a href="http://plg.uwaterloo.ca/~trlynam/spamjig/spamfilterjig/">spamjig</a>

512

is the automated system which performs this evaluation. You can download it

513

yourself and run it with your own email collections to test any participating

514

filters. Special <a href="http://www.lbreyer.com/gpl/README.TREC2005.txt">instructions</a> for dbacl can be found in the

515

TREC subdirectory of the source package. Many other open source spam filters

516

can also be tested in this framework.

427

517

</body>

428

518

</html>

b'\\ No newline at end of file'

Older »