~ubuntu-branches/ubuntu/hardy/dbacl/hardy

Obviously we've repeated the category one but dbacl treats them separately. dbacl can also print other measures of certainty besides the -U switch, but

they take longer to explain.

You can create as many categories as you want, one, two, three, good, bad, important, jokes, but remember that each one must be learned from a representative

collection of plain text documents.

139

158

To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form

140

159

<pre>

141

160

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \

142

-g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | head -10

161

-g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | grep match

162

match d191e93e []acquired[](1) 1

163

match 8c56f142 []inquire[](1) 1

164

match 7a2ccda2 []inquiry[](1) 1

165

match 38f595f3 []consequently[](1) 1

166

match a52053f2 []questions[](1) 1

167

match 78a0e302 []question[](1) 1

143

168

</pre>

144

169

145

This lists the first few matches, one per line, which exist in the sample1.txt document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model.

170

This command lists the first few matches, one per line, which exist in the sample1.txt document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model. However, when

171

you are trying out regular expressions, you can compare this output with the

172

contents of the document to see if your expression misses out on words or reads too many.

146

173

147

174

Sometimes, it's convenient to use parentheses which you want to throw away. dbacl

148

175

understands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z should be digits corresponding to the parentheses

222

249

% dbacl -l slick -w 2 sample1.txt

223

250

</pre>

224

251

225

produces a model slick which is nearly identical to smooth (the difference is that a regular expression cannot straddle newlines, but -w ngrams can).

252

produces a model slick which is nearly identical to smooth (the difference is that a regular expression cannot straddle newlines, but -w ngrams can). Let's look at the first few features: type

253

<pre>

254

% dbacl -l slick -w 2 sample1.txt -D | grep match | head -10

255

match 818ad280 []tom[](1) 1

256

match 5d20c0e2 []tom[]no[](1) 2

257

match 3db5da99 []no[](1) 1

258

match 4a18ad66 []no[]answer[](1) 2

259

match eea4a1c4 []answer[](1) 1

260

match 95392743 []answer[]tom[](1) 2

261

match 61cc1403 []answer[]what[](1) 2

262

match 8c953ec2 []what[](1) 1

263

match 4291d86e []what[]s[](1) 2

264

match b09aa375 []s[](1) 1

265

</pre>

266

You can see both pairs and single words, all lower case because dbacl

267

converts everything to lower case unless you tell it otherwise. This saves

268

a little on memory. But what did the original document look like?

269

<pre>

270

% head -10 sample1.txt

271

"TOM!"

272

273

No answer.

274

275

"TOM!"

276

277

No answer.

278

279

"What's gone with that boy, I wonder? You TOM!"

280

281

</pre>

282

Now you see how the pairs are formed. But wait, the pair of words

283

("TOM!", No) occurs twice in the text, but only once in the list of matches?

284

Did we miss one? No, look again at the line

285

<pre>

286

match 5d20c0e2 []tom[]no[](1) 2

287

</pre>

288

and you will see that the last value is '2', since we've seen it twice.

289

dbacl uses the frequencies of features to build its model.

226

290

227

291

Obviously, all this typing is getting tedious, and you will eventually

228

292

want to automate

253

317

The cross entropy is measured in bits and has the following meaning:

254

318

If we use our probabilistic model to construct an optimal compression algorithm,

255

319

then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature.

256

This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself.

320

This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself, and moreover

321

what we mean by compression is the act of compressing the features, not the full

322

document, which also contains punctuation and white space which is ignored.

257

323

258

324

To compute the cross entropy of category one, type

259

325

<pre>

260

326

% dbacl -c one sample1.txt -vn

261

cross_entropy 7.60 bits complexity 678

327

one 7.42 * 678.0

262

328

</pre>

263

329

264

The cross entropy is the first value returned. The second value essentially

330

The cross entropy is the first value (7.42) returned. The second value essentially

265

331

measures how many features describe the document.

266

332

Now suppose we try other models trained on the same document:

267

333

<pre>

268

334

% dbacl -c slick sample1.txt -vn

269

cross_entropy 4.74 bits complexity 677

335

slick 4.68 * 677.5

270

336

% dbacl -c smooth sample1.txt -vn

271

cross_entropy 5.27 bits complexity 603

337

smooth 6.03 * 640.5

272

338

</pre>

273

339

274

According to these estimates, both bigram models fit sample1.txt better. This is

340

The first thing to nota is that the complexity terms are not the same. The

341

slick category is based on word pairs (also called bigrams), of

342

which tere are 677 in this document. But there are 678 words, and the

343

fractional value indicates that the last word only counts for half a

344

feature. The smooth category also depends on word pairs, but

345

unlike slick, pairs cannot be counted if they straddle a

346

newline (this is a limitation of line-oriented regular expressions).

347

So in smooth, there are several missing word pairs, and various

348

single words which count as a fractional pair, giving a grand total of

349

640.5.

350

351

The second thing to note is that both bigram models fit sample1.txt better. This is

275

352

easy to see for slick, since the complexity (essentially the number of features)

276

is nearly the same as for one. But smooth looks at fewer features, and actually

277

compresses them just slightly better in this case. Let's ask dbacl which category fits better:

353

is nearly the same as for one, so the comparison reduces to seeing

354

which cross entropy is lowest.

355

Let's ask dbacl which category fits better:

356

<pre>

357

% dbacl -c one -c slick sample1.txt -v

358

slick

359

</pre>

360

361

You can do the same thing to compare one and smooth.

362

Let's ask dbacl which category fits better overall:

278

363

<pre>

279

364

% dbacl -c one -c slick -c smooth sample1.txt -v

280

smooth

365

slick

281

366

</pre>

282

367

368

We already know that slick is better than one, but why is

369

slick better than smooth? While slick looks at more

370

features than smooth (677.5 versus 640.5), it needs just 4.68 bits

371

of information per feature to represent the sample1.txt document,

372

while smooth needs 6.03 bits on average. So slick wins based

373

on economies of scale.

374

283

375

284

WARNING: dbacl doesn't yet cope well with widely different feature sets. Don't try to compare

285

categories built on completely different feature specifications unless you fully understand the

286

statistical implications.

376

WARNING: it is not always appropriate to classify documents whose models

377

look at different feature set like we did above. The underlying statistical

378

basis for these comparisons is the likelihood, but it is easy to

379

compare "apples and oranges" incorrectly. It is safest if you learn and

380

classify documents by using exactly the same command line switches

381

for every category.

287

382

288

383

289

384

<h2>Decision Theory</h2>

409

504

bayesol is a companion program for dbacl which makes the decision calculations

410

505

easier. The bad news is that you still have to write down a prior and loss matrix

411

506

yourself. Eventually, someone, somewhere may write a graphical interface.

507

The good news is that for most classification tasks, you don't need to

508

bother with bayesol at all, and can skip this section. Really.

412

509

413

510

bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of

414

511

misclassifications. For the toy example discussed earlier, the file toy.risk looks like this:

635

732

look at some scores:

636

733

<pre>

637

734

% dbacl -c one -c two -c three sample4.txt -n

638

one 10512.00 two 7908.17 three 10420.36

735

one 13549.34 two 8220.22 three 13476.84

639

736

% dbacl -c one -c two -c three sample4.txt -nv

640

one 20.25 * 519 two 15.24 * 519 three 20.08 * 519

737

one 26.11 * 519.0 two 15.84 * 519.0 three 25.97 * 519.0

641

738

</pre>

642

739

643

740

The first set of numbers are minus the logarithm (base 2) of each category's

652

749

653

750

Remember that dbacl calculates probabilities about resemblance

654

751

by weighing the evidence for all the features found in the input document.

655

There are 519 features in sample4.txt, and each feature contributes on average 20.25 bits of evidence against category one, 15.24 bits against category two and 20.08 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt:

752

There are 519 features in sample4.txt, and each feature contributes on average 26.11 bits of evidence against category one, 15.84 bits against category two and 25.97 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt:

656

753

<pre>

657

754

% head -25 sample4.txt | dbacl -c one -c two -c three -nv

658

one 19.01 * 324 two 14.61 * 324 three 18.94 * 324

755

one 20.15 * 324.0 two 15.18 * 324.0 three 20.14 * 324.0

659

756

</pre>

660

757

661

758

There are fewer features in the first 25 lines of sample4.txt than in the full

667

764

668

765

dbacl is still very sure, because it has looked at many features (324) and found

669

766

small differences which add up to quite different scores. However, you can see that

670

each feature now contributes less information (19.01, 14.61, 18.98) bits compared

671

to the earlier (20.25, 15.24, 20.08).

767

each feature now contributes less information (20.15, 15.18, 20.14) bits compared

768

to the earlier (26.11, 15.84, 25.97).

672

769

673

770

Since category two is obviously the best (closest to zero) choice among the

674

771

three models, let's drop it for a moment and consider the other two categories.

676

773

of sample4.txt has 15 words:

677

774

<pre>

678

775

% head -1 sample4.txt | dbacl -c one -c three -N

679

one 13.90% three 86.10%

776

one 25.65% three 74.35%

680

777

</pre>

681

778

Finally, we are getting probabilities we can understand! Unfortunately, this

682

779

is somewhat misleading. Each of the 15 words gave a score

686

783

probabilities are mixed:

687

784

<pre>

688

785

% head -1 sample4.txt | dbacl -c one -c three -nv

689

one 15.47 * 15 three 15.30 * 15

786

one 16.61 * 15.0 three 16.51 * 15.0

690

787

</pre>

691

788

692

789

So the interpretation of the probabilities is clear. dbacl weighs the

693

790

evidence from each feature it finds, and reports the best fit among the choices

694

it is offered. Because it sees so many features separately, it believes

695

its verdict is very sure. Of course, whether these

696

features are the right features to look at for best classification is another

697

matter entirely, and it's entirely up to you to decide.

791

it is offered. Because it sees so many features separately (hundreds usually), it believes

792

its verdict is very sure. Wouldn't you be after hundreds of checks?

793

Of course, whether these

794

features are independent, and are the right features to look at for best classification is another

795

matter entirely, and it's entirely up to you to decide. dbacl can't do much

796

about its inbuilt assumptions.

797

798

Last but not least, the probabilities above are not the same as the

799

confidence percentages printed by the -U switch. The -U switch was developed

800

to overcome the limitations above, by looking at dbacl's calculations

801

from a higher level, but this is a topic for another time.

698

802

</body>

699

803

</html>

b'\\ No newline at end of file'

Older »