2
<title>Language models, classification and dbacl</title>
4
<h1><center>Language models, classification and dbacl</center></h1>
5
<center><p>Laird A. Breyer</p></center>
8
This is a non-mathematical tutorial on how to use the dbacl Bayesian text
9
classifier. The mathematical details can be read <a href="dbacl.ps">here</a>.
11
<a href="http://www.lbreyer.com/gpl.html">dbacl</a> is a UNIX command
12
line tool, so you will need to work at the shell prompt (here written
13
%, even though we use Bash semantics). The program comes with five
14
sample text documents and a few scripts. Look for them in the same
15
directory as this tutorial, or you can use any other plain text
16
documents instead. Make sure the sample documents you will use are in
17
the current working directory. You need all <i>*.txt</i>, <i>*.pl</i>
18
and <i>*.risk</i> files.
20
The tutorial below is geared towards generic text classification. If you
21
intend to use dbacl for email classification, please read <a href="email.html">this</a>.
23
<h2>For the impatient</h2>
25
dbacl has two major modes of operation. The first is learning mode, where one
26
or more text documents are analysed to find out what make them look the way
27
they do. At the shell prompt, type (without the leading %)
29
% dbacl -l one sample1.txt
30
% dbacl -l two sample2.txt
33
This creates two files named <i>one</i> and <i>two</i>, which contain the important features of
36
The second major mode is classification mode.
37
Let's say that you want to see if <i>sample3.txt</i> is closer to <i>sample1.txt</i>
38
or <i>sample2.txt</i>; type
40
% dbacl -c one -c two sample3.txt -v
44
and dbacl should tell you it thinks <i>sample3.txt</i> is less like <i>two</i> (which is the
45
category learned from <i>sample2.txt</i>) and more like <i>one</i> (which is the
46
category learned from <i>sample1.txt</i>). That's it.
48
You can create as many categories as you want, <i>one</i>, <i>two</i>, <i>three</i>, <i>good</i>, <i>bad</i>, <i>important</i>, <i>jokes</i>, but remember that each one must be learned from a representative
49
collection of plain text documents.
51
dbacl is designed to be easy to use within a script, so you can make it part of
52
your own projects, perhaps a spam detection script, or an agent which automatically downloads the latest newspaper articles on your favourite topic...
54
<i>Tip:</i> The category files are created in the current directory, or whichever path you indicate. If you want, you can keep all your category files in a single common directory. For example, if you use the bash shell, type
57
% echo "DBACL_PATH=$HOME/.dbacl" >> $HOME/.bashrc
58
% source $HOME/.bashrc
60
From now on, all your categories will be kept in $HOME/.dbacl, and searched for there.
62
<h2>Language models</h2>
64
dbacl works by scanning the text it learns for features, which can be nearly
65
anything you like. For example, unless you tell it otherwise, the standard
66
features are all alphabetic single words in the document. dbacl builds a
67
statistical model, ie a probability distribution, based only on those features,
68
so anything that is not a feature will be ignored both during learning, and
69
during classification.
71
This dependence on features is a double edged sword, because it helps dbacl
72
focus on the things that matter (single alphabetic words by default), but
73
if something else matters more then it is ignored if you don't tell dbacl it's a
74
feature. This is the hard part, and it's up to you.
76
When telling dbacl what kind of features to look out for, you must use the language of regular expressions. For example, if you think the only interesting features for category <i>one</i> are words which contain the letter 'q', then you would type
78
% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \
79
-g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt
82
The rule is that dbacl always takes as a feature whatever it finds within round brackets.
83
Reading this can be painful if you don't know regular expressions, however.
85
In English, the first expression after the -g option above reads: take as a feature
86
any string which looks like: <b>"start of the line"</b> (written ^) followed by <b>"zero or more
87
characters within the range a-z or A-Z"</b> (written [a-zA-Z]*), followed by <b>"the character q"</b> (written q), followed by <b>"zero or more characters within the range a-z or A-Z"</b> (written [a-zA-Z]*). The second expression is nearly identical: <b>"a single character which is not in the range a-zA-Z"</b> (written [^a-zA-Z]), followed by <b>"zero or more characters within the range a-z or A-Z"</b> (can you guess?), followed by <b>"the character q"</b>, followed by <b>"zero or more characters within the range a-z or A-Z"</b>. The single quote marks are used to keep the whole expression together.
89
A regular expression is a simultaneous superposition of many text strings. Just
90
like a word, you read and write it one character at a time.
92
<table border="1" rules="all">
94
<td><b>Symbol</b></td>
95
<td><b>What it means</b></td>
99
<td>any character except newline</td>
103
<td>zero or more copies of preceding character or parenthesized expression</td>
107
<td>one or more copies of preceding character or parenthesized expression</td>
111
<td>zero or one copies of preceding character or parenthesized expression</td>
115
<td>beginning of line</td>
127
<td>one character equal to a, b or c</td>
131
<td>one character not equal to a, b or c</td>
134
<td>\*, \?, or \.</td>
135
<td>the actual character *, ? or .</td>
139
To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form
141
% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \
142
-g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | head -10
145
This lists the first few matches, one per line, which exist in the <i>sample1.txt</i> document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model.
147
Sometimes, it's convenient to use parentheses which you want to throw away. dbacl
148
understands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z should be digits corresponding to the parentheses
150
Here is an example for mixed Japanese and English documents, which matches alphabetic words and single ideograms:
152
% LANG=ja_JP dbacl -D -l konichiwa japanese.txt -i \
153
-g '(^|[^a-zA-Z0-9])([a-zA-Z0-9]+|[[:alpha:]])||2'
156
Note that you need a multilingual terminal and Japanese fonts to view this, and
157
your computer must have a Japanese locale available.
159
In the table below, you will find a list of some simple regular expressions to get you started:
161
<table border="1" rules="all">
163
<td><b>If you want to match...</b></td>
164
<td><b>Then you need this expression...</b></td>
165
<td><b>Examples</b></td>
168
<td>alphabetic words</td>
169
<td>(^|[^[:alpha:]]) ([[:alpha:]]+) ||2
171
<td>hello, kitty</td>
174
<td>words in capitals</td>
175
<td>(^|[^[A-Z]]) ([A-Z]+) ||2
177
<td>MAKE, MONEY, FAST</td>
180
<td>strings of characters separated by spaces</td>
181
<td>(^|[ ]) ([^ ]+) ||2
183
<td>w$%&tf9(, amazing!, :-)</td>
187
<td>(^|[^0-9]) ([0-9?[0-9]:[0-9][0-9](am|pm)) ||2
189
<td>9:17am, 12:30pm</td>
192
<td>words which end in a number</td>
193
<td>(^|[^a-zA-Z0-9]) ([a-zA-Z]+[0-9]+) [^a-zA-Z] ||2
195
<td>borg17234, A1</td>
198
<td>alphanumeric word pairs</td>
199
<td>(^|[^[:alnum:]]) ([[:alnum:]]+) [^[:alnum:]]+ ([[:alnum:]]+) ||23
201
<td>good morning, how are</td>
205
The last entry in the table above shows how to take word pairs as features.
206
Such models are called bigram models, as opposed to the unigram models whose
207
features are only single words, and they are used to capture extra information.
209
For example, in a unigram model the pair of words "well done" and "done well" have
210
the same probability. A bigram model can learn that "well done" is more common in food related documents (provided this combination of words was actually found within the learning corpus).
212
However, there is a big statistical problem: because there exist many more meaningful bigrams than unigrams, you'll need a much bigger corpus to obtain meaningful statistics. One way around this is a technique called smoothing, which predicts unseen bigrams from already seen unigrams. To obtain such a
213
combined unigram/bigram alphabetic word model, type
215
% dbacl -l smooth -g '(^|[^a-zA-Z])([a-zA-Z]+)||2' \
216
-g '(^|[^a-zA-Z])([a-zA-Z]+)[^a-zA-Z]+([a-zA-Z]+)||23' sample1.txt
219
If all you want are alphabetic bigrams, trigrams, etc, there is a special switch
220
-w you can use. The command
222
% dbacl -l slick -w 2 sample1.txt
225
produces a model <i>slick</i> which is nearly identical to <i>smooth</i> (the difference is that a regular expression cannot straddle newlines, but -w ngrams can).
227
Obviously, all this typing is getting tedious, and you will eventually
229
the learning stage in a shell script. Use regular expressions sparingly, as
230
they can quickly degrade the performance (speed and memory) of dbacl. See
231
<a href="#appendix">Appendix A</a> for ways around this.
233
<h2>Evaluating the models</h2>
235
Now that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use?
237
There is no easy answer to this problem.
238
Intuitively, a larger
239
set of features seems always preferable,
240
since it takes more information into account.
241
However, there is a tradeoff.
242
Comparing more features requires extra memory,
243
but much more importantly, too many features can <i>overfit</i> the data.
245
in a model which is so good at predicting the learned documents,
246
that virtually no other documents are considered even remotely similar.
248
It is beyond the scope of this tutorial to describe the variety of statistical
249
methods which can help decide what features are meaningful. However, to get a
250
rough idea of the quality of the model, we can look at the cross entropy
253
The cross entropy is measured in bits and has the following meaning:
254
If we use our probabilistic model to construct an optimal compression algorithm,
255
then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature.
256
This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself.
258
To compute the cross entropy of category <i>one</i>, type
260
% dbacl -c one sample1.txt -vn
261
cross_entropy 7.60 bits complexity 678
264
The cross entropy is the first value returned. The second value essentially
265
measures how many features describe the document.
266
Now suppose we try other models trained on the same document:
268
% dbacl -c slick sample1.txt -vn
269
cross_entropy 4.74 bits complexity 677
270
% dbacl -c smooth sample1.txt -vn
271
cross_entropy 5.27 bits complexity 603
274
According to these estimates, both bigram models fit <i>sample1.txt</i> better. This is
275
easy to see for <i>slick</i>, since the complexity (essentially the number of features)
276
is nearly the same as for <i>one</i>. But <i>smooth</i> looks at fewer features, and actually
277
compresses them just slightly better in this case. Let's ask dbacl which category fits better:
279
% dbacl -c one -c slick -c smooth sample1.txt -v
284
WARNING: dbacl doesn't yet cope well with widely different feature sets. Don't try to compare
285
categories built on completely different feature specifications unless you fully understand the
286
statistical implications.
289
<h2>Decision Theory</h2>
291
If you've read this far, then you probably intend to use dbacl to
292
automatically classify text documents, and possibly execute
293
certain actions depending on the outcome. The bad news is that dbacl isn't designed for this. The good news is that there is a companion program, bayesol,
294
which is. To use it, you just need to learn some Bayesian Decision Theory.
296
We'll suppose that the document <i>sample4.txt</i> must be classified in one of the
297
categories <i>one</i>, <i>two</i> and <i>three</i>.
298
To make optimal decisions, you'll need three ingredients: a <b>prior distribution</b>,
299
a set of <b>conditional probabilities</b> and a <b>measure of risk</b>. We'll get to these
302
The <b>prior distribution</b> is a set of weights, which you must choose yourself,
303
representing your beforehand beliefs. You choose this before you even look at
304
<i>sample4.txt</i>. For example, you might know from experience that category <i>one</i> is twice as
305
likely as two and three. The prior distribution is a set of weights you choose
306
to reflect your beliefs, e.g. <i>one</i>:2, <i>two</i>:1, <i>three</i>:1. If you have no idea what to
307
choose, give each an equal weight (<i>one</i>:1, <i>two</i>:1, <i>three</i>:1).
309
Next, we need <b>conditional probabilities</b>. This is what dbacl is for. Type
311
% dbacl -l three sample3.txt
312
% dbacl -c one -c two -c three sample4.txt -N
313
one 0.00% two 100.00% three 0.00%
316
As you can see, dbacl is 100% sure that <i>sample4.txt</i> resembles category <i>two</i>.
317
Such accurate answers are typical with the kinds of models used by dbacl.
318
In reality, the probabilities for <i>one</i> and <i>three</i> are very, very small and
319
the probability for <i>two</i> is really close, but not equal to 1.
320
See <a href="#appendix2">Appendix B</a> for a rough explanation.
322
We combine the prior (which represents your own beliefs and experiences) with
323
the conditionals (which represent what dbacl thinks about <i>sample4.txt</i>) to obtain
324
a set of <b>posterior probabilities</b>. In our example,
326
<li>Posterior probability that <i>sample4.txt</i> resembles <i>one</i>: 0%*2/(2+1+1) = 0%
327
<li>Posterior probability that <i>sample4.txt</i> resembles <i>two</i>: 100%*1/(2+1+1) = 100%
328
<li>Posterior probability that <i>sample4.txt</i> resembles <i>three</i>: 0%*1/(2+1+1) = 0%
330
Okay, so here the prior doesn't have much of an effect. But it's
331
there if you need it.
333
Now comes the tedious part.
334
What you really want to do
335
is take these posterior distributions under advisement, and make
336
an informed decision.
338
To decide which category best suits your own plans, you need to work
339
out the <b>costs of misclassifications</b>. Only you can decide these numbers, and there
340
are many. But at the end, you've worked out your risk. Here's an example:
342
<li>If <i>sample4.txt</i> is like <i>one</i> but it ends up marked like <i>one</i>, then the cost is <b>0</b>
343
<li>If <i>sample4.txt</i> is like <i>one</i> but it ends up marked like <i>two</i>, then the cost is <b>1</b>
344
<li>If <i>sample4.txt</i> is like <i>one</i> but it ends up marked like <i>three</i>, then the cost is <b>2</b>
345
<li>If <i>sample4.txt</i> is like <i>two</i> but it ends up marked like <i>one</i>, then the cost is <b>3</b>
346
<li>If <i>sample4.txt</i> is like <i>two</i> but it ends up marked like <i>two</i>, then the cost is <b>0</b>
347
<li>If <i>sample4.txt</i> is like <i>two</i> but it ends up marked like <i>three</i>, then the cost is <b>5</b>
348
<li>If <i>sample4.txt</i> is like <i>three</i> but it ends up marked like <i>one</i>, then the cost is <b>1</b>
349
<li>If <i>sample4.txt</i> is like <i>three</i> but it ends up marked like <i>two</i>, then the cost is <b>1</b>
350
<li>If <i>sample4.txt</i> is like <i>three</i> but it ends up marked like <i>three</i>, then the cost is <b>0</b>
352
These numbers are often placed in a table called the loss matrix (this
353
way, you can't forget a case), like so:
355
<table border="1" rules="all">
357
<td rowspan="2"><b>correct category</b></td>
358
<td colspan="3"><b>misclassified as</b></td>
363
<td><i>three</i></td>
378
<td><i>three</i></td>
386
We are now ready to combine all these numbers to obtain the True Bayesian Decision.
387
For every possible category, we simply weigh the risk with the posterior
388
probabilities of obtaining each of the possible misclassifications. Then we choose the category with least expected posterior risk.
391
<li>For category <i>one</i>, the expected risk is <b>0</b>*0% + <b>3</b>*1000% + <b>1</b>*0% = <b>3</b>
392
<li>For category <i>two</i>, the expected risk is <b>1</b>*0% + <b>0</b>*100% + <b>1</b>*0% = <b>0</b> <-- smallest
393
<li>For category <i>three</i>, the expected risk is <b>2</b>*0% + <b>5</b>*100% + <b>0</b>*0% = <b>5</b>
396
The lowest expected risk is for caterogy <i>two</i>, so that's the category we choose
397
to represent <i>sample4.txt</i>. Done!
399
Of course, the loss matrix above doesn't really have an effect on the
400
probability calculations, because the conditional probabilities strongly point to
401
category <i>two</i> anyway. But now you understand how the calculation works. Below, we'll look at a more realistic example (but still specially chosen to illustrate some points).
403
One last point: you may wonder how dbacl itself decides which category to
404
display when classifying with the -v switch. The simple answer is that dbacl always
405
displays the category with maximal conditional probability (often called the MAP estimate). This is mathematically completely equivalent to the special case of decision theory when the prior has equal weights, and the loss matrix takes the value 1 everywhere, except on the diagonal (ie correct classifications have no cost, everything else costs 1).
407
<h2>Using bayesol</h2>
409
bayesol is a companion program for dbacl which makes the decision calculations
410
easier. The bad news is that you still have to write down a prior and loss matrix
411
yourself. Eventually, someone, somewhere may write a graphical interface.
413
bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of
414
misclassifications. For the toy example discussed earlier, the file <i>toy.risk</i> looks like this:
429
Let's see if our hand calculation was correct:
431
% dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -v
435
Good! However, as discussed above, the misclassification costs need
436
improvement. This is completely up to you, but here are some possible
437
suggestions to get you started.
439
To devise effective loss matrices, it pays to think about the way that dbacl
440
computes the probabilities. <a href="#appendix2">Appendix B</a> gives some
441
details, but we don't need to go that far. Recall that the language models are
442
based on features (which are usually kinds of words).
443
Every feature counts towards the final probabilities, and a big document
444
will have more features, hence more opportunities to steer the
445
probabilities one way or another. So a feature is like an information
446
bearing unit of text.
448
When we read a text document which doesn't accord with our expectations, we
449
grow progressively more annoyed as we read further into the text. This is like
450
an annoyance interest rate which compounds on information units within the text.
451
For dbacl, the number of information bearing units is reported as the complexity
453
This suggests that the cost of reading a misclassified document could have the
454
form (1 + interest)^complexity. Here's an example loss matrix which uses this idea
457
"" one [ 0, (1.1)^complexity, (1.1)^complexity ]
458
"" two [(1.1)^complexity, 0, (1.7)^complexity ]
459
"" three [(1.5)^complexity, (1.01)^complexity, 0 ]
463
Remember, these aren't monetary interest rates, they are value judgements.
464
You can see this loss matrix in action by typing
466
% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -v
470
Now if we increase the cost of misclassifying <i>two</i> as <i>three</i> from
471
1.7 to 2.0, the optimal category becomes
473
% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -v
477
bayesol can also handle infinite costs. Just write "inf" where you need it.
478
This is particularly useful with regular expressions. If you look at each
479
row of loss_matrix above, you see an empty string "" before each category.
480
This indicates that this row is to be used by default in the actual loss matrix.
481
But sometimes, the losses can depend on seeing a particular string in the document we want to classify.
483
Suppose you normally like to use the loss matrix above, but in case the
484
document contains the word "Polly", then the cost of misclassification
485
is infinite. Here is an updated loss_matrix:
488
"" one [ 0, (1.1)^complexity, (1.1)^complexity ]
489
"Polly" two [ inf, 0, inf ]
490
"" two [(1.1)^complexity, 0, (2.0)^complexity ]
491
"" three [(1.5)^complexity, (1.01)^complexity, 0 ]
495
bayesol looks in its input for the regular expression "Polly", and if it
496
is found, then for misclassifications away from <i>two</i>,
497
it uses the row with the infinite values, otherwise it uses the default
498
row, which starts with "". If you have several rows with regular expressions
499
for each category, bayesol always uses the first one from the top which
500
matches within the input. You must always have at least a default row for
503
The regular expression facility can also be used to perform more complicated
504
document dependent loss calculations. Suppose you like to count the number
505
of lines of the input document which start with the character '>', as a
506
proportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion.
509
# this is file prop.pl
511
$special = $normal = 0;
512
while(<SDTIN>) {
517
$prop = $special/$normal;
518
print "proportion: $prop\n";
521
If we used this script, then we could take the output of dbacl, append the
522
proportion of lines containing '>', and pass the result as input to bayesol.
523
For example, the following line is included in the <i>example3.risk</i>
526
"^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ]
529
and through this, bayesol reads, if present,
530
the line containing the proportion we
532
takes this into account when it constructs the loss matrix.
533
You can try this like so:
535
% dbacl -T email -c one -c two -c three sample6.txt -nav \
536
| perl prop.pl | bayesol -c example3.risk -v
539
Note that in the loss_matrix specification
540
above, $1 refers to the <i>numerical</i> value of the
541
quantity inside the parentheses. Also, it is useful to remember that
542
when using the -a switch, dbacl outputs all the original lines
543
from <i>unknown.txt</i> with an extra space in front of them. If another
544
instance of dbacl needs to read this output again (e.g. in a pipeline),
545
then the latter should be invoked with the -A switch.
547
<h2>Miscellaneous</h2>
549
Be careful when classifying very small strings.
550
Except for the multinomial models (which includes the default model),
551
the dbacl calculations are optimized for large strings
552
with more than 20 or 30 features.
553
For small text lines, the complex models give only approximate scores.
554
In those cases, stick with unigram models, which are always exact.
556
In the UNIX philosophy, programs are small and do one thing well. Following this philosophy, dbacl essentially only reads plain text documents. If you have non-textual documents (word, html, postscript) which you want to learn from, you will need to use specialized tools to first convert these into plain text. There are many free tools available for this.
558
dbacl has limited support for reading mbox files (UNIX email) and can filter out html tags in a quick and dirty way, however this is only intended as a convenience, and should not be relied upon to be fully accurate.
560
<h2><a name="appendix">Appendix A: memory requirements</a></h2>
562
When experimenting with complicated models, dbacl will quickly fill up its hash
563
tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.
565
For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.
567
<table border="1" rules="all">
569
<td><b>command</b></td>
570
<td><b>Unique features</b></td>
571
<td><b>Category size</b></td>
572
<td><b>Learning time</b></td>
575
<td>dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16</td>
576
<td align="right">49,251</td>
577
<td align="right">512K</td>
581
<td>dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20</td>
582
<td align="right">909,400</td>
583
<td align="right">6.1M</td>
587
<td>dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22</td>
588
<td align="right">3,151,718</td>
589
<td align="right">24M</td>
594
As can be seen from this table, including bigrams and trigrams has a noticeable
595
memory and performance effect during learning. Luckily, classification speed
596
is only affected by the number of features found in the unknown document.
598
<table border="1" rules="all">
600
<td><b>command</b></td>
601
<td><b>features</b></td>
602
<td><b>Classification time</b></td>
605
<td>dbacl -c twain1 Twain-Collected_Works.txt</td>
607
<td align="right">0m4.860s</td>
610
<td>dbacl -c twain2 Twain-Collected_Works.txt</td>
611
<td>unigrams and bigrams</td>
612
<td align="right">0m8.930s</td>
615
<td>dbacl -c twain3 Twain-Collected_Works.txt</td>
616
<td>unigrams, bigrams and trigrams</td>
617
<td align="right">0m12.750s</td>
621
The heavy memory requirements during learning of complicated models can be
622
reduced at the expense of the model itself. dbacl has a feature decimation switch
623
which slows down the hash table filling rate by simply ignoring many of the
624
features found in the input.
626
<h2><a name="appendix2">Appendix B: Extreme probabilities</a></h2>
628
Why is the result of a dbacl probability calculation always so accurate?
630
% dbacl -c one -c two -c three sample4.txt -N
631
one 0.00% two 100.00% three 0.00%
634
The reason for this has to do with the type of model which dbacl uses. Let's
637
% dbacl -c one -c two -c three sample4.txt -n
638
one 10512.00 two 7908.17 three 10420.36
639
% dbacl -c one -c two -c three sample4.txt -nv
640
one 20.25 * 519 two 15.24 * 519 three 20.08 * 519
643
The first set of numbers are minus the logarithm (base 2) of each category's
644
probability of producing the full document <i>sample4.txt</i>. This represents the
645
evidence <i>away</i> from each category, and is measured in bits.
646
<i>one</i> and <i>three</i> are fairly even, but <i>two</i> has by far
647
the lowest score and hence highest probability (in other words, the model
648
for <i>two</i> is the least bad at predicting <i>sample4.txt</i>, so if there are only
649
three possible choices, it's the best).
650
To understand these numbers, it's best to split each of them up into
651
a product of cross entropy and complexity, as is done in the second line.
653
Remember that dbacl calculates probabilities about resemblance
654
by weighing the evidence for all the features found in the input document.
655
There are 519 features in <i>sample4.txt</i>, and each feature contributes on average 20.25 bits of evidence against category <i>one</i>, 15.24 bits against category <i>two</i> and 20.08 bits against category <i>three</i>. Let's look at what happens if we only look at the first 25 lines of <i>sample4.txt</i>:
657
% head -25 sample4.txt | dbacl -c one -c two -c three -nv
658
one 19.01 * 324 two 14.61 * 324 three 18.94 * 324
661
There are fewer features in the first 25 lines of <i>sample4.txt</i> than in the full
662
text file, but the picture is substantially unchanged.
664
% head -25 sample4.txt | dbacl -c one -c two -c three -N
665
one 0.00% two 100.00% three 0.00%
668
dbacl is still very sure, because it has looked at many features (324) and found
669
small differences which add up to quite different scores. However, you can see that
670
each feature now contributes less information (19.01, 14.61, 18.98) bits compared
671
to the earlier (20.25, 15.24, 20.08).
673
Since category <i>two</i> is obviously the best (closest to zero) choice among the
674
three models, let's drop it for a moment and consider the other two categories.
675
We also reduce dramatically the number of features (words) we shall look at. The first line
676
of <i>sample4.txt</i> has 15 words:
678
% head -1 sample4.txt | dbacl -c one -c three -N
679
one 13.90% three 86.10%
681
Finally, we are getting probabilities we can understand! Unfortunately, this
682
is somewhat misleading. Each of the 15 words gave a score
683
and these scores were added for each category. Since both categories here are about equally
684
bad at predicting words in <i>sample4.txt</i>, the difference in the final scores for category
685
<i>one</i> and <i>three</i> amounts to less than 3 bits of information, which is why the
686
probabilities are mixed:
688
% head -1 sample4.txt | dbacl -c one -c three -nv
689
one 15.47 * 15 three 15.30 * 15
692
So the interpretation of the probabilities is clear. dbacl weighs the
693
evidence from each feature it finds, and reports the best fit among the choices
694
it is offered. Because it sees so many features separately, it believes
695
its verdict is very sure. Of course, whether these
696
features are the right features to look at for best classification is another
697
matter entirely, and it's entirely up to you to decide.
b'\\ No newline at end of file'