~ubuntu-branches/ubuntu/precise/dbacl/precise

You can create as many categories as you want, one, two, three, good, bad, important, jokes, but remember that each one must be learned from a representative

collection of plain text documents.

dbacl is designed to be easy to use within a script, so you can make it part of

your own projects, perhaps a spam detection script, or an agent which automatically downloads the latest newspaper articles on your favourite topic...

Tip: The category files are created in the current directory, or whichever path you indicate. If you want, you can keep all your category files in a single common directory. For example, if you use the bash shell, type

<pre>

% mkdir $HOME/.dbacl

% echo "DBACL_PATH=$HOME/.dbacl" >> $HOME/.bashrc

% source $HOME/.bashrc

</pre>

From now on, all your categories will be kept in $HOME/.dbacl, and searched for there.

<h2>Language models</h2>

dbacl works by scanning the text it learns for features, which can be nearly

anything you like. For example, unless you tell it otherwise, the standard

features are all alphabetic single words in the document. dbacl builds a

statistical model, ie a probability distribution, based only on those features,

so anything that is not a feature will be ignored both during learning, and

during classification.

This dependence on features is a double edged sword, because it helps dbacl

focus on the things that matter (single alphabetic words by default), but

if something else matters more then it is ignored if you don't tell dbacl it's a

feature. This is the hard part, and it's up to you.

When telling dbacl what kind of features to look out for, you must use the language of regular expressions. For example, if you think the only interesting features for category one are words which contain the letter 'q', then you would type

<pre>

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \

-g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt

</pre>

The rule is that dbacl always takes as a feature whatever it finds within round brackets.

Reading this can be painful if you don't know regular expressions, however.

In English, the first expression after the -g option above reads: take as a feature

any string which looks like: "start of the line" (written ^) followed by "zero or more

characters within the range a-z or A-Z" (written [a-zA-Z]*), followed by "the character q" (written q), followed by "zero or more characters within the range a-z or A-Z" (written [a-zA-Z]*). The second expression is nearly identical: "a single character which is not in the range a-zA-Z" (written [^a-zA-Z]), followed by "zero or more characters within the range a-z or A-Z" (can you guess?), followed by "the character q", followed by "zero or more characters within the range a-z or A-Z". The single quote marks are used to keep the whole expression together.

A regular expression is a simultaneous superposition of many text strings. Just

like a word, you read and write it one character at a time.

<tr>

<td>Symbol</td>

<td>What it means</td>

</tr>

<tr>

<td>any character except newline</td>

100

</tr>

101

<tr>

102

103

<td>zero or more copies of preceding character or parenthesized expression</td>

104

</tr>

105

<tr>

106

107

<td>one or more copies of preceding character or parenthesized expression</td>

108

</tr>

109

<tr>

110

111

<td>zero or one copies of preceding character or parenthesized expression</td>

112

</tr>

113

<tr>

114

115

<td>beginning of line</td>

116

</tr>

117

<tr>

118

119

120

</tr>

121

<tr>

122

123

124

</tr>

125

<tr>

126

127

<td>one character equal to a, b or c</td>

128

</tr>

129

<tr>

130

131

<td>one character not equal to a, b or c</td>

132

</tr>

133

<tr>

134

135

<td>the actual character *, ? or .</td>

136

</tr>

137

</table>

138

139

To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form

140

<pre>

141

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \

142

-g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | head -10

143

</pre>

144

145

This lists the first few matches, one per line, which exist in the sample1.txt document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model.

146

147

Sometimes, it's convenient to use parentheses which you want to throw away. dbacl

148

understands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z should be digits corresponding to the parentheses

149

you want to keep.

150

Here is an example for mixed Japanese and English documents, which matches alphabetic words and single ideograms:

151

<pre>

152

% LANG=ja_JP dbacl -D -l konichiwa japanese.txt -i \

153

-g '(^|[^a-zA-Z0-9])([a-zA-Z0-9]+|[[:alpha:]])||2'

154

</pre>

155

156

Note that you need a multilingual terminal and Japanese fonts to view this, and

157

your computer must have a Japanese locale available.

158

159

In the table below, you will find a list of some simple regular expressions to get you started:

160

161

162

<tr>

163

<td>If you want to match...</td>

164

<td>Then you need this expression...</td>

165

<td>Examples</td>

166

</tr>

167

<tr>

168

<td>alphabetic words</td>

169

<td>(^|[^[:alpha:]]) ([[:alpha:]]+) ||2

170

</td>

171

<td>hello, kitty</td>

172

</tr>

173

<tr>

174

<td>words in capitals</td>

175

<td>(^|[^[A-Z]]) ([A-Z]+) ||2

176

</td>

177

<td>MAKE, MONEY, FAST</td>

178

</tr>

179

<tr>

180

<td>strings of characters separated by spaces</td>

181

<td>(^|[ ]) ([^ ]+) ||2

182

</td>

183

<td>w$%&tf9(, amazing!, :-)</td>

184

</tr>

185

<tr>

186

187

<td>(^|[^0-9]) ([0-9?[0-9]:[0-9][0-9](am|pm)) ||2

188

</td>

189

190

</tr>

191

<tr>

192

<td>words which end in a number</td>

193

<td>(^|[^a-zA-Z0-9]) ([a-zA-Z]+[0-9]+) [^a-zA-Z] ||2

194

</td>

195

196

</tr>

197

<tr>

198

<td>alphanumeric word pairs</td>

199

<td>(^|[^[:alnum:]]) ([[:alnum:]]+) [^[:alnum:]]+ ([[:alnum:]]+) ||23

200

</td>

201

<td>good morning, how are</td>

202

</tr>

203

</table>

204

205

The last entry in the table above shows how to take word pairs as features.

206

Such models are called bigram models, as opposed to the unigram models whose

207

features are only single words, and they are used to capture extra information.

208

209

For example, in a unigram model the pair of words "well done" and "done well" have

210

the same probability. A bigram model can learn that "well done" is more common in food related documents (provided this combination of words was actually found within the learning corpus).

211

212

However, there is a big statistical problem: because there exist many more meaningful bigrams than unigrams, you'll need a much bigger corpus to obtain meaningful statistics. One way around this is a technique called smoothing, which predicts unseen bigrams from already seen unigrams. To obtain such a

213

combined unigram/bigram alphabetic word model, type

214

<pre>

215

% dbacl -l smooth -g '(^|[^a-zA-Z])([a-zA-Z]+)||2' \

216

-g '(^|[^a-zA-Z])([a-zA-Z]+)[^a-zA-Z]+([a-zA-Z]+)||23' sample1.txt

217

</pre>

218

219

If all you want are alphabetic bigrams, trigrams, etc, there is a special switch

220

-w you can use. The command

221

<pre>

222

% dbacl -l slick -w 2 sample1.txt

223

</pre>

224

225

produces a model slick which is nearly identical to smooth (the difference is that a regular expression cannot straddle newlines, but -w ngrams can).

226

227

Obviously, all this typing is getting tedious, and you will eventually

228

want to automate

229

the learning stage in a shell script. Use regular expressions sparingly, as

230

they can quickly degrade the performance (speed and memory) of dbacl. See

231

<a href="#appendix">Appendix A</a> for ways around this.

232

233

<h2>Evaluating the models</h2>

234

235

Now that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use?

236

237

There is no easy answer to this problem.

238

Intuitively, a larger

239

set of features seems always preferable,

240

since it takes more information into account.

241

However, there is a tradeoff.

242

Comparing more features requires extra memory,

243

but much more importantly, too many features can overfit the data.

244

This results

245

in a model which is so good at predicting the learned documents,

246

that virtually no other documents are considered even remotely similar.

247

248

It is beyond the scope of this tutorial to describe the variety of statistical

249

methods which can help decide what features are meaningful. However, to get a

250

rough idea of the quality of the model, we can look at the cross entropy

251

reported by dbacl.

252

253

The cross entropy is measured in bits and has the following meaning:

254

If we use our probabilistic model to construct an optimal compression algorithm,

255

then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature.

256

This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself.

257

258

To compute the cross entropy of category one, type

259

<pre>

260

% dbacl -c one sample1.txt -vn

261

cross_entropy 7.60 bits complexity 678

262

</pre>

263

264

The cross entropy is the first value returned. The second value essentially

265

measures how many features describe the document.

266

Now suppose we try other models trained on the same document:

267

<pre>

268

% dbacl -c slick sample1.txt -vn

269

cross_entropy 4.74 bits complexity 677

270

% dbacl -c smooth sample1.txt -vn

271

cross_entropy 5.27 bits complexity 603

272

</pre>

273

274

According to these estimates, both bigram models fit sample1.txt better. This is

275

easy to see for slick, since the complexity (essentially the number of features)

276

is nearly the same as for one. But smooth looks at fewer features, and actually

277

compresses them just slightly better in this case. Let's ask dbacl which category fits better:

278

<pre>

279

% dbacl -c one -c slick -c smooth sample1.txt -v

280

smooth

281

</pre>

282

283

284

WARNING: dbacl doesn't yet cope well with widely different feature sets. Don't try to compare

285

categories built on completely different feature specifications unless you fully understand the

286

statistical implications.

287

288

289

<h2>Decision Theory</h2>

290

291

If you've read this far, then you probably intend to use dbacl to

292

automatically classify text documents, and possibly execute

293

certain actions depending on the outcome. The bad news is that dbacl isn't designed for this. The good news is that there is a companion program, bayesol,

294

which is. To use it, you just need to learn some Bayesian Decision Theory.

295

296

We'll suppose that the document sample4.txt must be classified in one of the

297

categories one, two and three.

298

To make optimal decisions, you'll need three ingredients: a prior distribution,

299

a set of conditional probabilities and a measure of risk. We'll get to these

300

in turn.

301

302

The prior distribution is a set of weights, which you must choose yourself,

303

representing your beforehand beliefs. You choose this before you even look at

304

sample4.txt. For example, you might know from experience that category one is twice as

305

likely as two and three. The prior distribution is a set of weights you choose

306

to reflect your beliefs, e.g. one:2, two:1, three:1. If you have no idea what to

307

choose, give each an equal weight (one:1, two:1, three:1).

308

309

Next, we need conditional probabilities. This is what dbacl is for. Type

310

<pre>

311

% dbacl -l three sample3.txt

312

% dbacl -c one -c two -c three sample4.txt -N

313

one 0.00% two 100.00% three 0.00%

314

</pre>

315

316

As you can see, dbacl is 100% sure that sample4.txt resembles category two.

317

Such accurate answers are typical with the kinds of models used by dbacl.

318

In reality, the probabilities for one and three are very, very small and

319

the probability for two is really close, but not equal to 1.

320

See <a href="#appendix2">Appendix B</a> for a rough explanation.

321

322

We combine the prior (which represents your own beliefs and experiences) with

323

the conditionals (which represent what dbacl thinks about sample4.txt) to obtain

324

a set of posterior probabilities. In our example,

325

<ul>

326

<li>Posterior probability that sample4.txt resembles one: 0%*2/(2+1+1) = 0%

327

<li>Posterior probability that sample4.txt resembles two: 100%*1/(2+1+1) = 100%

328

<li>Posterior probability that sample4.txt resembles three: 0%*1/(2+1+1) = 0%

329

</ul>

330

Okay, so here the prior doesn't have much of an effect. But it's

331

there if you need it.

332

333

Now comes the tedious part.

334

What you really want to do

335

is take these posterior distributions under advisement, and make

336

an informed decision.

337

338

To decide which category best suits your own plans, you need to work

339

out the costs of misclassifications. Only you can decide these numbers, and there

340

are many. But at the end, you've worked out your risk. Here's an example:

341

<ul>

342

<li>If sample4.txt is like one but it ends up marked like one, then the cost is 0

343

<li>If sample4.txt is like one but it ends up marked like two, then the cost is 1

344

<li>If sample4.txt is like one but it ends up marked like three, then the cost is 2

345

<li>If sample4.txt is like two but it ends up marked like one, then the cost is 3

346

<li>If sample4.txt is like two but it ends up marked like two, then the cost is 0

347

<li>If sample4.txt is like two but it ends up marked like three, then the cost is 5

348

<li>If sample4.txt is like three but it ends up marked like one, then the cost is 1

349

<li>If sample4.txt is like three but it ends up marked like two, then the cost is 1

350

<li>If sample4.txt is like three but it ends up marked like three, then the cost is 0

351

</ul>

352

These numbers are often placed in a table called the loss matrix (this

353

way, you can't forget a case), like so:

354

355

356

<tr>

357

<td rowspan="2">correct category</td>

358

<td colspan="3">misclassified as</td>

359

</tr>

360

<tr>

361

362

363

<td>three</td>

364

</tr>

365

<tr>

366

367

368

369

370

</tr>

371

<tr>

372

373

374

375

376

</tr>

377

<tr>

378

<td>three</td>

379

380

381

382

</tr>

383

</table>

384

385

386

We are now ready to combine all these numbers to obtain the True Bayesian Decision.

387

For every possible category, we simply weigh the risk with the posterior

388

probabilities of obtaining each of the possible misclassifications. Then we choose the category with least expected posterior risk.

389

390

<ul>

391

<li>For category one, the expected risk is 0*0% + 3*1000% + 1*0% = 3

392

<li>For category two, the expected risk is 1*0% + 0*100% + 1*0% = 0 <-- smallest

393

<li>For category three, the expected risk is 2*0% + 5*100% + 0*0% = 5

394

</ul>

395

396

The lowest expected risk is for caterogy two, so that's the category we choose

397

to represent sample4.txt. Done!

398

399

Of course, the loss matrix above doesn't really have an effect on the

400

probability calculations, because the conditional probabilities strongly point to

401

category two anyway. But now you understand how the calculation works. Below, we'll look at a more realistic example (but still specially chosen to illustrate some points).

402

403

One last point: you may wonder how dbacl itself decides which category to

404

display when classifying with the -v switch. The simple answer is that dbacl always

405

displays the category with maximal conditional probability (often called the MAP estimate). This is mathematically completely equivalent to the special case of decision theory when the prior has equal weights, and the loss matrix takes the value 1 everywhere, except on the diagonal (ie correct classifications have no cost, everything else costs 1).

406

407

<h2>Using bayesol</h2>

408

409

bayesol is a companion program for dbacl which makes the decision calculations

410

easier. The bad news is that you still have to write down a prior and loss matrix

411

yourself. Eventually, someone, somewhere may write a graphical interface.

412

413

bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of

414

misclassifications. For the toy example discussed earlier, the file toy.risk looks like this:

415

<pre>

416

categories {

417

one, two, three

418

}

419

prior {

420

2, 1, 1

421

}

422

loss_matrix {

423

"" one [ 0, 1, 2 ]

424

"" two [ 3, 0, 5 ]

425

"" three [ 1, 1, 0 ]

426

}

427

</pre>

428

429

Let's see if our hand calculation was correct:

430

<pre>

431

% dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -v

432

two

433

</pre>

434

435

Good! However, as discussed above, the misclassification costs need

436

improvement. This is completely up to you, but here are some possible

437

suggestions to get you started.

438

439

To devise effective loss matrices, it pays to think about the way that dbacl

440

computes the probabilities. <a href="#appendix2">Appendix B</a> gives some

441

details, but we don't need to go that far. Recall that the language models are

442

based on features (which are usually kinds of words).

443

Every feature counts towards the final probabilities, and a big document

444

will have more features, hence more opportunities to steer the

445

probabilities one way or another. So a feature is like an information

446

bearing unit of text.

447

448

When we read a text document which doesn't accord with our expectations, we

449

grow progressively more annoyed as we read further into the text. This is like

450

an annoyance interest rate which compounds on information units within the text.

451

For dbacl, the number of information bearing units is reported as the complexity

452

of the text.

453

This suggests that the cost of reading a misclassified document could have the

454

form (1 + interest)^complexity. Here's an example loss matrix which uses this idea

455

<pre>

456

loss_matrix {

457

"" one [ 0, (1.1)^complexity, (1.1)^complexity ]

458

"" two [(1.1)^complexity, 0, (1.7)^complexity ]

459

"" three [(1.5)^complexity, (1.01)^complexity, 0 ]

460

}

461

</pre>

462

463

Remember, these aren't monetary interest rates, they are value judgements.

464

You can see this loss matrix in action by typing

465

<pre>

466

% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -v

467

three

468

</pre>

469

470

Now if we increase the cost of misclassifying two as three from

471

1.7 to 2.0, the optimal category becomes

472

<pre>

473

% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -v

474

two

475

</pre>

476

477

bayesol can also handle infinite costs. Just write "inf" where you need it.

478

This is particularly useful with regular expressions. If you look at each

479

row of loss_matrix above, you see an empty string "" before each category.

480

This indicates that this row is to be used by default in the actual loss matrix.

481

But sometimes, the losses can depend on seeing a particular string in the document we want to classify.

482

483

Suppose you normally like to use the loss matrix above, but in case the

484

document contains the word "Polly", then the cost of misclassification

485

is infinite. Here is an updated loss_matrix:

486

<pre>

487

loss_matrix {

488

"" one [ 0, (1.1)^complexity, (1.1)^complexity ]

489

"Polly" two [ inf, 0, inf ]

490

"" two [(1.1)^complexity, 0, (2.0)^complexity ]

491

"" three [(1.5)^complexity, (1.01)^complexity, 0 ]

492

}

493

</pre>

494

495

bayesol looks in its input for the regular expression "Polly", and if it

496

is found, then for misclassifications away from two,

497

it uses the row with the infinite values, otherwise it uses the default

498

row, which starts with "". If you have several rows with regular expressions

499

for each category, bayesol always uses the first one from the top which

500

matches within the input. You must always have at least a default row for

501

every category.

502

503

The regular expression facility can also be used to perform more complicated

504

document dependent loss calculations. Suppose you like to count the number

505

of lines of the input document which start with the character '>', as a

506

proportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion.

507

<pre>

508

#!/usr/bin/perl

509

# this is file prop.pl

510

511

$special = $normal = 0;

512

while(<SDTIN>) {

513

$special++ if /^ >/;

514

$normal++;

515

print;

516

}

517

$prop = $special/$normal;

518

print "proportion: $prop\n";

519

</pre>

520

521

If we used this script, then we could take the output of dbacl, append the

522

proportion of lines containing '>', and pass the result as input to bayesol.

523

For example, the following line is included in the example3.risk

524

specification

525

<pre>

526

"^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ]

527

</pre>

528

529

and through this, bayesol reads, if present,

530

the line containing the proportion we

531

calculated and

532

takes this into account when it constructs the loss matrix.

533

You can try this like so:

534

<pre>

535

% dbacl -T email -c one -c two -c three sample6.txt -nav \

536

| perl prop.pl | bayesol -c example3.risk -v

537

</pre>

538

539

Note that in the loss_matrix specification

540

above, $1 refers to the numerical value of the

541

quantity inside the parentheses. Also, it is useful to remember that

542

when using the -a switch, dbacl outputs all the original lines

543

from unknown.txt with an extra space in front of them. If another

544

instance of dbacl needs to read this output again (e.g. in a pipeline),

545

then the latter should be invoked with the -A switch.

546

547

<h2>Miscellaneous</h2>

548

549

Be careful when classifying very small strings.

550

Except for the multinomial models (which includes the default model),

551

the dbacl calculations are optimized for large strings

552

with more than 20 or 30 features.

553

For small text lines, the complex models give only approximate scores.

554

In those cases, stick with unigram models, which are always exact.

555

556

In the UNIX philosophy, programs are small and do one thing well. Following this philosophy, dbacl essentially only reads plain text documents. If you have non-textual documents (word, html, postscript) which you want to learn from, you will need to use specialized tools to first convert these into plain text. There are many free tools available for this.

557

558

dbacl has limited support for reading mbox files (UNIX email) and can filter out html tags in a quick and dirty way, however this is only intended as a convenience, and should not be relied upon to be fully accurate.

559

560

<h2><a name="appendix">Appendix A: memory requirements</a></h2>

561

562

When experimenting with complicated models, dbacl will quickly fill up its hash

563

tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.

564

565

For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.

566

567

568

<tr>

569

<td>command</td>

570

<td>Unique features</td>

571

<td>Category size</td>

572

<td>Learning time</td>

573

</tr>

574

<tr>

575

<td>dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16</td>

576

577

578

579

</tr>

580

<tr>

581

<td>dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20</td>

582

583

584

585

</tr>

586

<tr>

587

<td>dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22</td>

588

589

590

591

</tr>

592

</table>

593

594

As can be seen from this table, including bigrams and trigrams has a noticeable

595

memory and performance effect during learning. Luckily, classification speed

596

is only affected by the number of features found in the unknown document.

597

598

599

<tr>

600

<td>command</td>

601

<td>features</td>

602

<td>Classification time</td>

603

</tr>

604

<tr>

605

<td>dbacl -c twain1 Twain-Collected_Works.txt</td>

606

<td>unigrams</td>

607

608

</tr>

609

<tr>

610

<td>dbacl -c twain2 Twain-Collected_Works.txt</td>

611

<td>unigrams and bigrams</td>

612

613

</tr>

614

<tr>

615

<td>dbacl -c twain3 Twain-Collected_Works.txt</td>

616

<td>unigrams, bigrams and trigrams</td>

617

618

</tr>

619

</table>

620

621

The heavy memory requirements during learning of complicated models can be

622

reduced at the expense of the model itself. dbacl has a feature decimation switch

623

which slows down the hash table filling rate by simply ignoring many of the

624

features found in the input.

625

626

<h2><a name="appendix2">Appendix B: Extreme probabilities</a></h2>

627

628

Why is the result of a dbacl probability calculation always so accurate?

629

<pre>

630

% dbacl -c one -c two -c three sample4.txt -N

631

one 0.00% two 100.00% three 0.00%

632

</pre>

633

634

The reason for this has to do with the type of model which dbacl uses. Let's

635

look at some scores:

636

<pre>

637

% dbacl -c one -c two -c three sample4.txt -n

638

one 10512.00 two 7908.17 three 10420.36

639

% dbacl -c one -c two -c three sample4.txt -nv

640

one 20.25 * 519 two 15.24 * 519 three 20.08 * 519

641

</pre>

642

643

The first set of numbers are minus the logarithm (base 2) of each category's

644

probability of producing the full document sample4.txt. This represents the

645

evidence away from each category, and is measured in bits.

646

one and three are fairly even, but two has by far

647

the lowest score and hence highest probability (in other words, the model

648

for two is the least bad at predicting sample4.txt, so if there are only

649

three possible choices, it's the best).

650

To understand these numbers, it's best to split each of them up into

651

a product of cross entropy and complexity, as is done in the second line.

652

653

Remember that dbacl calculates probabilities about resemblance

654

by weighing the evidence for all the features found in the input document.

655

There are 519 features in sample4.txt, and each feature contributes on average 20.25 bits of evidence against category one, 15.24 bits against category two and 20.08 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt:

656

<pre>

657

% head -25 sample4.txt | dbacl -c one -c two -c three -nv

658

one 19.01 * 324 two 14.61 * 324 three 18.94 * 324

659

</pre>

660

661

There are fewer features in the first 25 lines of sample4.txt than in the full

662

text file, but the picture is substantially unchanged.

663

<pre>

664

% head -25 sample4.txt | dbacl -c one -c two -c three -N

665

one 0.00% two 100.00% three 0.00%

666

</pre>

667

668

dbacl is still very sure, because it has looked at many features (324) and found

669

small differences which add up to quite different scores. However, you can see that

670

each feature now contributes less information (19.01, 14.61, 18.98) bits compared

671

to the earlier (20.25, 15.24, 20.08).

672

673

Since category two is obviously the best (closest to zero) choice among the

674

three models, let's drop it for a moment and consider the other two categories.

675

We also reduce dramatically the number of features (words) we shall look at. The first line

676

of sample4.txt has 15 words:

677

<pre>

678

% head -1 sample4.txt | dbacl -c one -c three -N

679

one 13.90% three 86.10%

680

</pre>

681

Finally, we are getting probabilities we can understand! Unfortunately, this

682

is somewhat misleading. Each of the 15 words gave a score

683

and these scores were added for each category. Since both categories here are about equally

684

bad at predicting words in sample4.txt, the difference in the final scores for category

685

one and three amounts to less than 3 bits of information, which is why the

686

probabilities are mixed:

687

<pre>

688

% head -1 sample4.txt | dbacl -c one -c three -nv

689

one 15.47 * 15 three 15.30 * 15

690

</pre>

691

692

So the interpretation of the probabilities is clear. dbacl weighs the

693

evidence from each feature it finds, and reports the best fit among the choices

694

it is offered. Because it sees so many features separately, it believes

695

its verdict is very sure. Of course, whether these

696

features are the right features to look at for best classification is another

697

matter entirely, and it's entirely up to you to decide.

698

</body>

699

</html>

b'\\ No newline at end of file'

Older »