~ubuntu-branches/ubuntu/hardy/dbacl/hardy

Committer: Bazaar Package Importer
Author(s): Zak B. Elep
Date: 2006-03-26 22:35:35 UTC
mfrom: (1.1.2 upstream) (2.1.1 etch)
Revision ID: james.westby@ubuntu.com-20060326223535-bo3m96paoczzz59n

Tags: 1.12-1

http://bugs.debian.org/339394

* New upstream release
  + `dbacl -V' now exits with status 0.  (Closes: #339394)
* debian/rules:
  + Upstream now fixes TREC file permissions.  However some new scripts got
    added, so this time its a a+x fix instead of a-x.
* debian/patches:
  + Removed 10_slang2_conversion.patch from Clint, now merged upstream.
  + Updated 20_autotools_update.patch .

files added:
TREC/OPTIONS.TREC2005.1cefhuj

TREC/OPTIONS.TREC2005.2adphu

TREC/OPTIONS.TREC2005.3adphd

TREC/OPTIONS.TREC2005.4adp

TREC/TREC2005.txt

config/depcomp

debian/compat

debian/patches

debian/patches/20_autotools_update.patch

doc/chess

doc/chess/Makefile.am

doc/chess/Makefile.in

doc/chess/combine_half_moves.sh

doc/chess/csfpc1.png

doc/chess/csfpc2.png

doc/chess/csfpc3.png

doc/chess/dce-1.sh

doc/chess/dce-2.sh

doc/chess/dce-3.sh

doc/chess/dce-basic.sh

doc/chess/dce.sh

doc/chess/down.png

doc/chess/randomizer.awk

doc/chess/renorm.awk

doc/chess/spam_chess.html

doc/chess/spoiler.png

man/hypex.1in

src/lint-check.sh

src/splintrc

src/tests/email-style.shin

src/tests/sample.spam-11

src/tests/verify.email-style

files removed:
src/stamp-h.in

files modified:
COPYING

ChangeLog

INSTALL

Makefile.am

Makefile.in

NEWS

TREC/Makefile.am

TREC/Makefile.in

TREC/SFX

aclocal.m4

config/config.guess

config/config.sub

configure

configure.in

contrib/Makefile.in

debian/changelog

debian/control

debian/copyright

debian/rules

doc/Makefile.am

doc/Makefile.in

doc/email.html

man/Makefile.am

man/Makefile.in

man/dbacl.1in

src/Makefile.am

src/Makefile.in

src/catfun.c

src/config.h.in

src/dbacl.c

src/dbacl.h

src/fh.c

src/hypex.c

src/hypex.h

src/icheck.c

src/mailinspect.c

src/mbw.c

src/mbw.h

src/probs.c

src/risk-lexer.c

src/risk-parser.c

src/risk-parser.h

src/risk-parser.y

src/tests/Makefile.am

src/tests/Makefile.in

src/tests/dbacl-g.shin

src/tests/icheck.shin

src/tests/lscheck.shin

src/tests/model-sum1.shin

src/tests/verify.email-xheaders

src/util.c

src/util.h

ts/Makefile.in

Show diffs side-by-side

added added

removed removed

man/dbacl.1in

(fragments), selected according to various switches. Learning roughly

works by tweaking token probabilities until the training data is least

surprising.

.SH EXIT STATUS

The normal shell exit conventions aren't followed (sorry!). When using the

.B -l

command form,

.B dbacl

returns zero on success, nonzero if an error occurs. When using the

.B -c

form,

100

.B dbacl

101

returns a positive integer corresponding to the

102

.I category

103

with the highest posterior probability. In case of a tie, the first most probable category is chosen. If an error occurs,

104

.B dbacl

105

returns zero.

106

.SH DESCRIPTION

107

.PP

108

When using the

140

154

.PP

141

155

To see an output for a classification, you must use at least one of

142

156

the

143

.BR -v , -U , -n , -D , -d

157

.BR -v , -U , -n , -N , -D , -d

144

158

switches. Sometimes, they can be used in combination to produce

145

159

a natural variation of their individual outputs. Sometimes,

146

160

.B dbacl

157

171

of 100% means that

158

172

.B dbacl

159

173

is sure of its choice, while a percentage of 0% means that some other

160

category is equally likely.

174

category is equally likely. This is not the model probability, but measures

175

how unambiguous the classification is, and can

176

be used to tag unsure classifications (e.g. if the confidence is 25% or less).

177

.PP

178

The

179

.B -N

180

switch prints each category name followed by its (posterior) probability, expressed as a percentage. The percentages always sum to 100%. This is intuitive, but only valuable if the document

181

being classified contains a handful of tokens (ten or less). In the common

182

case with many more tokens, the probabilities are always extremely close to 100% and 0%.

161

183

.PP

162

184

The

163

185

.B -n

164

186

switch prints each category name followed by the negative logarithm of its

165

probability. The smallest number gives the best category. A more convenient

187

probability. This is equivalent to using the

188

.B -N

189

switch, but much more useful. The smallest number gives the best category. A more convenient

166

190

form is to use both

167

191

.B -n

168

192

and

176

200

of p-value for each category score. This indicates how typical the

177

201

score is compared to the training documents, but only works if the

178

202

.B -X

179

switch was used during learning, and only for some types of models.

203

switch was used during learning, and only for some types of models (e.g. email).

180

204

These p-values are uniformly distributed and independent (if the

181

205

categories are independent), so can be combined using Fisher's chi

182

206

squared test to obtain composite p-values for groupings of categories.

187

211

.B -X

188

212

switches together print each category name followed by a detailed

189

213

decomposition of the category score, factored into ( divergence rate +

190

shannon entropy rate )* token count @ p-value.

214

shannon entropy rate )* token count @ p-value. Again, this only works in some types of models.

191

215

.PP

192

216

The

193

217

.B -v

376

400

.I deftok

377

401

is "alpha". Possible values for

378

402

.I deftok

379

are "alpha", "alnum", "graph", "cef" and "adp".

380

The last two are custom tokenizers intended for email messages. See also

403

are "alpha", "alnum", "graph", "char", "cef" and "adp".

404

The last two are custom tokenizers intended for email messages. See also

381

405

.BR isalpha (3).

406

The "char" tokenizer picks up single printable characters

407

rather than bigger tokens, and is intended for testing only.

382

408

.IP -f

383

409

Filter each line of input separately, passing to STDOUT only lines

384

410

which match the

611

637

Headers are recognized and most are skipped. To include extra RFC822 standard

612

638

headers (except for trace headers), use the "email:headers" subtype.

613

639

To include

614

trace headers, use the "email:theaders" subtype. To name

640

trace headers, use the "email:theaders" subtype. To include

615

641

all headers in the email, use the "email:xheaders" subtype. To skip all headers,

616

642

except the subject, use "email:noheaders". To scan

617

643

binary attachments for strings, use the "email:atts" subtype.

699

725

.PP

700

726

Note that the

701

727

.B -v

702

option is necessary, otherwise

728

option at least is necessary, otherwise

703

729

.B dbacl

704

730

does not print anything. The return value is

705

1 in the first case, 2 in the second. If you want to print a simple confidence

731

1 in the first case, 2 in the second.

732

.PP

733

.ad

734

% echo "to be or not to be" | dbacl -v -N -c twain -c shake

735

.br

736

twain 22.63% shake 77.37%

737

.br

738

% echo "to be or not to be" | dbacl -v -n -c twain -c shake

739

.br

740

twain 7.04 * 6.0 shake 6.74 * 6.0

741

.ad

742

.PP

743

These invocations are equivalent. The numbers 6.74 and 7.04 represent how close

744

the average token is to each category, and 6.0 is the number of tokens observed. If you want to print a simple confidence

706

745

value together with the best category, replace

707

746

.B -v

708

747

with

709

748

.BR -U .

710

749

.PP

750

.na

751

% echo "to be or not to be" | dbacl -U -c twain -c shake

752

.br

753

shake # 34%

754

.ad

755

.PP

756

Note that the true probability of category

757

.I shake

758

versus category

759

.I twain

760

is 77.37%, but the calculation is somewhat ambiguous, and 34% is the confidence out of 100% that the calculation is qualitatively correct.

761

.PP

711

762

Suppose a file document.txt contains English text lines interspersed

712

763

with noise lines. To filter out the noise lines from the English lines,

713

764

assuming you have an existing category shake say, type:

888

939

documents is sufficient for highly accurate results for years.

889

940

Continual learning after such a critical mass results in diminishing returns.

890

941

Of course, when real world input document patterns change

891

dramatically, the predictive power of the models can be lost.

942

dramatically, the predictive power of the models can be lost. At the other

943

end, a few hundred documents already give acceptable results in most cases.

892

944

.PP

893

945

.B dbacl

894

946

is heavily optimized for the case of frequent classifications but infrequent

933

985

When classifying a document,

934

986

.B dbacl

935

987

loads all indicated categories into RAM, so the total memory needed is

936

approximately the sum of the category file sizes plus a fixed overhead.

988

approximately the sum of the category file sizes plus a fixed small overhead.

937

989

The input document is consumed while being read, so its size doesn't matter,

938

990

but very long lines can take up space.

939

991

When using the

1053

1105

.BR awk (1),

1054

1106

.BR bayesol (1),

1055

1107

.BR crontab (1),

1108

.BR hmine (1),

1109

.BR hypex (1),

1056

1110

.BR less (1),

1057

1111

.BR mailcross (1),

1058

1112

.BR mailfoot (1),

Older »