~ubuntu-branches/ubuntu/natty/spamassassin/natty

« back to all changes in this revision

Viewing changes to .pc/50_sa-learn_fix_empty_list_handling/sa-learn.raw

Committer: Package Import Robot
Author(s): Noah Meyerhans
Date: 2010-03-21 23:20:31 UTC
mfrom: (0.4.1) (1.4.1) (29.2.1 maverick)
Revision ID: package-import@ubuntu.com-20100321232031-ryqjxh9cx27epnka

Tags: 3.3.1-1

* New upstream version.
* Update several patches now that bugfixes have been incorporated
upstream.

files added:
.pc

.pc/.version

.pc/10_change_config_paths

.pc/10_change_config_paths/INSTALL

.pc/10_change_config_paths/README

.pc/10_change_config_paths/UPGRADE

.pc/10_change_config_paths/USAGE

.pc/10_change_config_paths/ldap

.pc/10_change_config_paths/ldap/README

.pc/10_change_config_paths/lib

.pc/10_change_config_paths/lib/Mail

.pc/10_change_config_paths/lib/Mail/SpamAssassin

.pc/10_change_config_paths/lib/Mail/SpamAssassin/Conf.pm

.pc/10_change_config_paths/lib/Mail/SpamAssassin/Plugin

.pc/10_change_config_paths/lib/Mail/SpamAssassin/Plugin/Test.pm

.pc/10_change_config_paths/lib/spamassassin-run.pod

.pc/10_change_config_paths/rules

.pc/10_change_config_paths/rules/user_prefs.template

.pc/10_change_config_paths/sa-compile.raw

.pc/10_change_config_paths/sa-learn.raw

.pc/10_change_config_paths/spamc

.pc/10_change_config_paths/spamc/spamc.pod

.pc/10_change_config_paths/spamd

.pc/10_change_config_paths/spamd/README

.pc/10_change_config_paths/spamd/README.vpopmail

.pc/10_change_config_paths/spamd/spamd.raw

.pc/10_change_config_paths/sql

.pc/10_change_config_paths/sql/README

.pc/10_change_config_paths/sql/README.awl

.pc/10_change_config_paths/t

.pc/10_change_config_paths/t/data

.pc/10_change_config_paths/t/data/testplugin.pm

.pc/20_edit_spamc_pod

.pc/20_edit_spamc_pod/spamc

.pc/20_edit_spamc_pod/spamc/spamc.pod

.pc/30_edit_README

.pc/30_edit_README/README

.pc/50_sa-learn_fix_empty_list_handling

.pc/50_sa-learn_fix_empty_list_handling/sa-learn.raw

.pc/60_fix-pod

.pc/60_fix-pod/lib

.pc/60_fix-pod/lib/Mail

.pc/60_fix-pod/lib/Mail/SpamAssassin

.pc/60_fix-pod/lib/Mail/SpamAssassin/Bayes.pm

.pc/60_fix-pod/spamassassin.raw

.pc/70_fix-whatis

.pc/70_fix-whatis/lib

.pc/70_fix-whatis/lib/Mail

.pc/70_fix-whatis/lib/Mail/SpamAssassin

.pc/70_fix-whatis/lib/Mail/SpamAssassin/Plugin

.pc/70_fix-whatis/lib/Mail/SpamAssassin/Plugin/OneLineBodyRuleType.pm

.pc/70_fix-whatis/lib/Mail/SpamAssassin/Util

.pc/70_fix-whatis/lib/Mail/SpamAssassin/Util/DependencyInfo.pm

.pc/70_fix-whatis/lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm

.pc/80_fix_man_warnings

.pc/80_fix_man_warnings/lib

.pc/80_fix_man_warnings/lib/Mail

.pc/80_fix_man_warnings/lib/Mail/SpamAssassin

.pc/80_fix_man_warnings/lib/Mail/SpamAssassin/Conf.pm

.pc/80_fix_man_warnings/sa-learn.raw

.pc/applied-patches

pkgrules/20_aux_tlds.cf

t/data/spam/dnsbl_domsonly.eml

t/data/spam/dnsbl_ipsonly.eml

t/uribl_all_types.t

t/uribl_domains_only.t

t/uribl_ips_only.t

files removed:
debian/patches/.dpkg-source-applied

debian/patches/81_fix_man_speling

files modified:
CREDITS

Changes

MANIFEST

MANIFEST.SKIP

META.yml

Makefile.PL

debian/NEWS

debian/changelog

debian/control

debian/patches/10_change_config_paths

debian/patches/20_edit_spamc_pod

debian/patches/60_fix-pod

debian/patches/70_fix-whatis

debian/patches/80_fix_man_warnings

debian/patches/series

debian/rules

debian/spamassassin.cron.daily

debian/spamassassin.dirs

debian/spamassassin.postinst

lib/Mail/SpamAssassin.pm

lib/Mail/SpamAssassin/AICache.pm

lib/Mail/SpamAssassin/BayesStore/DBM.pm

lib/Mail/SpamAssassin/BayesStore/PgSQL.pm

lib/Mail/SpamAssassin/Conf.pm

lib/Mail/SpamAssassin/Conf/LDAP.pm

lib/Mail/SpamAssassin/Conf/Parser.pm

lib/Mail/SpamAssassin/Conf/SQL.pm

lib/Mail/SpamAssassin/Dns.pm

lib/Mail/SpamAssassin/Logger/Stderr.pm

lib/Mail/SpamAssassin/Logger/Syslog.pm

lib/Mail/SpamAssassin/Message.pm

lib/Mail/SpamAssassin/Message/Metadata.pm

lib/Mail/SpamAssassin/Message/Node.pm

lib/Mail/SpamAssassin/Plugin.pm

lib/Mail/SpamAssassin/Plugin/ASN.pm

lib/Mail/SpamAssassin/Plugin/Bayes.pm

lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm

lib/Mail/SpamAssassin/Plugin/Check.pm

lib/Mail/SpamAssassin/Plugin/DCC.pm

lib/Mail/SpamAssassin/Plugin/DKIM.pm

lib/Mail/SpamAssassin/Plugin/ImageInfo.pm

lib/Mail/SpamAssassin/Plugin/PhishTag.pm

lib/Mail/SpamAssassin/Plugin/Shortcircuit.pm

lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm

lib/Mail/SpamAssassin/Plugin/VBounce.pm

lib/Mail/SpamAssassin/SpamdForkScaling.pm

lib/Mail/SpamAssassin/SubProcBackChannel.pm

lib/Mail/SpamAssassin/Timeout.pm

lib/Mail/SpamAssassin/Util.pm

lib/Mail/SpamAssassin/Util/DependencyInfo.pm

lib/Mail/SpamAssassin/Util/Progress.pm

lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm

lib/spamassassin-run.pod

pkgrules/25_uribl.cf

pkgrules/50_scores.cf

pkgrules/72_active.cf

sa-update.raw

spamassassin.spec

spamd-apache2/lib/Mail/SpamAssassin/Spamd.pm

spamd-apache2/lib/Mail/SpamAssassin/Spamd/Apache2/AclRFC1413.pm

spamd/spamd.raw

t/README

t/SATest.pm

t/data/spam/dnsbl.eml

t/data/taintcheckplugin.pm

t/data/testplugin.pm

t/data/testplugin2.pm

t/dkim2.t

t/dnsbl.t

t/dnsbl_sc_meta.t

t/uribl.t

Show diffs side-by-side

added added

removed removed

.pc/50_sa-learn_fix_empty_list_handling/sa-learn.raw

#!/usr/bin/perl -w -T

# <@LICENSE>

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to you under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at:

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# </@LICENSE>

use strict;

use bytes;

use Errno qw(EBADF);

use Getopt::Long;

use Pod::Usage;

use File::Spec;

use vars qw(

$spamtest %opt $isspam $forget

$messagecount $learnedcount $messagelimit

$progress $total_messages $init_results $start_time

$synconly $learnprob @targets $bayes_override_path

);

my $PREFIX = '@@PREFIX@@'; # substituted at 'make' time

my $DEF_RULES_DIR = '@@DEF_RULES_DIR@@'; # substituted at 'make' time

my $LOCAL_RULES_DIR = '@@LOCAL_RULES_DIR@@'; # substituted at 'make' time

use lib '@@INSTALLSITELIB@@'; # substituted at 'make' time

BEGIN { # see comments in "spamassassin.raw" for doco

my @bin = File::Spec->splitpath($0);

my $bin = ($bin[0] ? File::Spec->catpath(@bin[0..1]) : $bin[1])

|| File::Spec->curdir;

if (-e $bin.'/lib/Mail/SpamAssassin.pm'

|| !-e '@@INSTALLSITELIB@@/Mail/SpamAssassin.pm' )

{

my $searchrelative;

$searchrelative = 1; # disabled during "make install": REMOVEFORINST

if ($searchrelative && $bin eq '../' && -e '../blib/lib/Mail/SpamAssassin.pm')

{

unshift ( @INC, '../blib/lib' );

} else {

foreach ( qw(lib ../lib/site_perl

../lib/spamassassin ../share/spamassassin/lib))

{

my $dir = File::Spec->catdir( $bin, split ( '/', $_ ) );

if ( -f File::Spec->catfile( $dir, "Mail", "SpamAssassin.pm" ) )

{ unshift ( @INC, $dir ); last; }

}

use Mail::SpamAssassin;

use Mail::SpamAssassin::ArchiveIterator;

use Mail::SpamAssassin::Message;

use Mail::SpamAssassin::PerMsgLearner;

use Mail::SpamAssassin::Util::Progress;

use Mail::SpamAssassin::Logger;

###########################################################################

$SIG{PIPE} = 'IGNORE';

# used to be CmdLearn::cmd_run() ...

%opt = (

'force-expire' => 0,

'use-ignores' => 0,

'nosync' => 0,

'cf' => []

);

Getopt::Long::Configure(

qw(bundling no_getopt_compat

permute no_auto_abbrev no_ignore_case)

);

GetOptions(

'forget' => \$forget,

'ham|nonspam' => sub { $isspam = 0; },

'spam' => sub { $isspam = 1; },

'sync' => \$synconly,

'rebuild' => sub { $synconly = 1; warn "The --rebuild option has been deprecated. Please use --sync instead.\n" },

'username|u=s' => \$opt{'username'},

'configpath|config-file|config-dir|c|C=s' => \$opt{'configpath'},

'prefspath|prefs-file|p=s' => \$opt{'prefspath'},

100

'siteconfigpath=s' => \$opt{'siteconfigpath'},

101

'cf=s' => \@{$opt{'cf'}},

102

103

'folders|f=s' => \$opt{'folders'},

104

'force-expire|expire' => \$opt{'force-expire'},

105

'local|L' => \$opt{'local'},

106

'no-sync|nosync' => \$opt{'nosync'},

107

'showdots' => \$opt{'showdots'},

108

'progress' => \$opt{'progress'},

109

'use-ignores' => \$opt{'use-ignores'},

110

'no-rebuild|norebuild' => sub { $opt{'nosync'} = 1; warn "The --no-rebuild option has been deprecated. Please use --no-sync instead.\n" },

111

112

'learnprob=f' => \$opt{'learnprob'},

113

'randseed=i' => \$opt{'randseed'},

114

'stopafter=i' => \$opt{'stopafter'},

115

116

'debug|debug-level|D:s' => \$opt{'debug'},

117

'help|h|?' => \$opt{'help'},

118

'version|V' => \$opt{'version'},

119

120

'dump:s' => \$opt{'dump'},

121

'import' => \$opt{'import'},

122

123

'backup' => \$opt{'backup'},

124

'clear' => \$opt{'clear'},

125

'restore=s' => \$opt{'restore'},

126

127

'dir' => sub { $opt{'old_format'} = 'dir'; },

128

'file' => sub { $opt{'old_format'} = 'file'; },

129

'mbox' => sub { $opt{'format'} = 'mbox'; },

130

'mbx' => sub { $opt{'format'} = 'mbx'; },

131

'single' => sub { $opt{'old_format'} = 'single'; },

132

133

'db|dbpath=s' => \$bayes_override_path,

134

're|regexp=s' => \$opt{'regexp'},

135

136

'<>' => \&target,

137

)

138

or usage( 0, "Unknown option!" );

139

140

if ( defined $opt{'help'} ) {

141

usage( 0, "For more information read the manual page" );

142

}

143

if ( defined $opt{'version'} ) {

144

print "SpamAssassin version " . Mail::SpamAssassin::Version() . "\n";

145

exit 0;

146

}

147

148

# set debug areas, if any specified (only useful for command-line tools)

149

if (defined $opt{'debug'}) {

150

$opt{'debug'} ||= 'all';

151

}

152

153

if ( $opt{'force-expire'} ) {

154

$synconly = 1;

155

}

156

157

if ($opt{'showdots'} && $opt{'progress'}) {

158

print "--showdots and --progress may not be used together, please select just one\n";

159

exit 0;

160

}

161

162

if ( !defined $isspam

163

&& !defined $synconly

164

&& !defined $forget

165

&& !defined $opt{'dump'}

166

&& !defined $opt{'import'}

167

&& !defined $opt{'clear'}

168

&& !defined $opt{'backup'}

169

&& !defined $opt{'restore'}

170

&& !defined $opt{'folders'} )

171

{

172

usage( 0,

173

"Please select either --spam, --ham, --folders, --forget, --sync, --import,\n--dump, --clear, --backup or --restore"

174

);

175

}

176

177

# We need to make sure the journal syncs pre-forget...

178

if ( defined $forget && $opt{'nosync'} ) {

179

$opt{'nosync'} = 0;

180

warn

181

"sa-learn warning: --forget requires read/write access to the database, and is incompatible with --no-sync\n";

182

}

183

184

if ( defined $opt{'old_format'} ) {

185

186

#Format specified in the 2.5x form of --dir, --file, --mbox, --mbx or --single.

187

#Convert it to the new behavior:

188

if ( $opt{'old_format'} eq 'single' ) {

189

push ( @ARGV, '-' );

190

}

191

}

192

193

my $post_config = '';

194

195

# kluge to support old check_bayes_db operation

196

# bug 3799: init() will go r/o with the configured DB, and then dbpath needs

197

# to override. Just access the dbpath version via post_config_text.

198

if ( defined $bayes_override_path ) {

199

# Add a default prefix if the path is a directory

200

if ( -d $bayes_override_path ) {

201

$bayes_override_path = File::Spec->catfile( $bayes_override_path, 'bayes' );

202

}

203

204

$post_config .= "bayes_path $bayes_override_path\n";

205

}

206

207

# These options require bayes_scanner, which requires "use_bayes 1", but

208

# that's not necessary for these commands.

209

if (defined $opt{'dump'} || defined $opt{'import'} || defined $opt{'clear'} ||

210

defined $opt{'backup'} || defined $opt{'restore'}) {

211

$post_config .= "use_bayes 1\n";

212

}

213

214

$post_config .= join("\n", @{$opt{'cf'}})."\n";

215

216

# create the tester factory

217

$spamtest = new Mail::SpamAssassin(

218

{

219

rules_filename => $opt{'configpath'},

220

site_rules_filename => $opt{'siteconfigpath'},

221

userprefs_filename => $opt{'prefspath'},

222

username => $opt{'username'},

223

debug => $opt{'debug'},

224

local_tests_only => $opt{'local'},

225

dont_copy_prefs => 1,

226

PREFIX => $PREFIX,

227

DEF_RULES_DIR => $DEF_RULES_DIR,

228

LOCAL_RULES_DIR => $LOCAL_RULES_DIR,

229

post_config_text => $post_config,

230

}

231

);

232

233

$spamtest->init(1);

234

dbg("sa-learn: spamtest initialized");

235

236

# Bug 6228 hack: bridge the transition gap of moving Bayes.pm into a plugin;

237

# To be resolved more cleanly!!!

238

if ($spamtest->{bayes_scanner}) {

239

foreach my $plugin ( @{ $spamtest->{plugins}->{plugins} } ) {

240

if ($plugin->isa('Mail::SpamAssassin::Plugin::Bayes')) {

241

# copy plugin's "store" object ref one level up!

242

$spamtest->{bayes_scanner}->{store} = $plugin->{store};

243

}

244

}

245

}

246

247

if (Mail::SpamAssassin::Util::am_running_on_windows()) {

248

binmode(STDIN) or die "cannot set binmode on STDIN: $!"; # bug 4363

249

binmode(STDOUT) or die "cannot set binmode on STDOUT: $!";

250

}

251

252

if ( defined $opt{'dump'} ) {

253

my ( $magic, $toks );

254

255

if ( $opt{'dump'} eq 'all' || $opt{'dump'} eq '' ) { # show us all tokens!

256

( $magic, $toks ) = ( 1, 1 );

257

}

258

elsif ( $opt{'dump'} eq 'magic' ) { # show us magic tokens only

259

( $magic, $toks ) = ( 1, 0 );

260

}

261

elsif ( $opt{'dump'} eq 'data' ) { # show us data tokens only

262

( $magic, $toks ) = ( 0, 1 );

263

}

264

else { # unknown option

265

warn "Unknown dump option '" . $opt{'dump'} . "'\n";

266

$spamtest->finish_learner();

267

exit 1;

268

}

269

270

if (!$spamtest->dump_bayes_db( $magic, $toks, $opt{'regexp'}) ) {

271

$spamtest->finish_learner();

272

die "ERROR: Bayes dump returned an error, please re-run with -D for more information\n";

273

}

274

275

$spamtest->finish_learner();

276

# make sure we notice any write errors while flushing output buffer

277

close STDOUT or die "error closing STDOUT: $!";

278

close STDIN or die "error closing STDIN: $!";

279

exit 0;

280

}

281

282

if ( defined $opt{'import'} ) {

283

my $ret = $spamtest->{bayes_scanner}->{store}->perform_upgrade();

284

$spamtest->finish_learner();

285

# make sure we notice any write errors while flushing output buffer

286

close STDOUT or die "error closing STDOUT: $!";

287

close STDIN or die "error closing STDIN: $!";

288

exit( !$ret );

289

}

290

291

if (defined $opt{'clear'}) {

292

unless ($spamtest->{bayes_scanner}->{store}->clear_database()) {

293

$spamtest->finish_learner();

294

die "ERROR: Bayes clear returned an error, please re-run with -D for more information\n";

295

}

296

297

$spamtest->finish_learner();

298

# make sure we notice any write errors while flushing output buffer

299

close STDOUT or die "error closing STDOUT: $!";

300

close STDIN or die "error closing STDIN: $!";

301

exit 0;

302

}

303

304

if (defined $opt{'backup'}) {

305

unless ($spamtest->{bayes_scanner}->{store}->backup_database()) {

306

$spamtest->finish_learner();

307

die "ERROR: Bayes backup returned an error, please re-run with -D for more information\n";

308

}

309

310

$spamtest->finish_learner();

311

# make sure we notice any write errors while flushing output buffer

312

close STDOUT or die "error closing STDOUT: $!";

313

close STDIN or die "error closing STDIN: $!";

314

exit 0;

315

}

316

317

if (defined $opt{'restore'}) {

318

319

my $filename = $opt{'restore'};

320

321

unless ($filename) {

322

$spamtest->finish_learner();

323

die "ERROR: You must specify a filename to restore.\n";

324

}

325

326

unless ($spamtest->{bayes_scanner}->{store}->restore_database($filename, $opt{'showdots'})) {

327

$spamtest->finish_learner();

328

die "ERROR: Bayes restore returned an error, please re-run with -D for more information\n";

329

}

330

331

$spamtest->finish_learner();

332

# make sure we notice any write errors while flushing output buffer

333

close STDOUT or die "error closing STDOUT: $!";

334

close STDIN or die "error closing STDIN: $!";

335

exit 0;

336

}

337

338

if ( !$spamtest->{conf}->{use_bayes} ) {

339

warn "ERROR: configuration specifies 'use_bayes 0', sa-learn disabled\n";

340

exit 1;

341

}

342

343

$spamtest->init_learner(

344

{

345

force_expire => $opt{'force-expire'},

346

learn_to_journal => $opt{'nosync'},

347

wait_for_lock => 1,

348

caller_will_untie => 1

349

}

350

);

351

352

$spamtest->{bayes_scanner}{use_ignores} = $opt{'use-ignores'};

353

354

if ($synconly) {

355

$spamtest->rebuild_learner_caches(

356

{

357

verbose => 1,

358

showdots => $opt{'showdots'}

359

}

360

);

361

$spamtest->finish_learner();

362

# make sure we notice any write errors while flushing output buffer

363

close STDOUT or die "error closing STDOUT: $!";

364

close STDIN or die "error closing STDIN: $!";

365

exit 0;

366

}

367

368

$messagelimit = $opt{'stopafter'};

369

$learnprob = $opt{'learnprob'};

370

371

if ( defined $opt{'randseed'} ) {

372

srand( $opt{'randseed'} );

373

}

374

375

# sync the journal first if we're going to go r/w so we make sure to

376

# learn everything before doing anything else.

377

378

if ( !$opt{nosync} ) {

379

$spamtest->rebuild_learner_caches();

380

}

381

382

# what is the result of the run? will end up being the exit code.

383

my $exit_status = 0;

384

385

# run this lot in an eval block, so we can catch die's and clear

386

# up the dbs.

387

eval {

388

$SIG{HUP} = \&killed;

389

$SIG{INT} = \&killed;

390

$SIG{TERM} = \&killed;

391

392

if ( $opt{folders} ) {

393

open( F, $opt{folders} ) or die "cannot open $opt{folders}: $!";

394

for ($!=0; <F>; $!=0) {

395

chomp;

396

next if /^\s*$/;

397

if (/^(?:ham|spam):\w*:/) {

398

push ( @targets, $_ );

399

}

400

else {

401

target($_);

402

}

403

}

404

defined $_ || $!==0 or

405

$!==EBADF ? dbg("error reading from $opt{folders}: $!")

406

: die "error reading from $opt{folders}: $!";

407

close(F) or die "error closing $opt{folders}: $!";

408

}

409

410

###########################################################################

411

# Deal with the target listing, and STDIN -> tempfile

412

413

my $tempfile; # will be defined if stdin -> tempfile

414

push(@targets, @ARGV);

415

@targets = ('-') unless @targets;

416

417

for(my $elem = 0; $elem <= $#targets; $elem++) {

418

# ArchiveIterator doesn't really like STDIN, so if "-" is specified

419

# as a target, make it a temp file instead.

420

if ( $targets[$elem] =~ /(?:^|:)-$/ ) {

421

if (defined $tempfile) {

422

# uh-oh, stdin specified multiple times?

423

warn "skipping extra stdin target (".$targets[$elem].")\n";

424

splice @targets, $elem, 1;

425

$elem--; # go back to this element again

426

next;

427

}

428

else {

429

my $handle;

430

( $tempfile, $handle ) = Mail::SpamAssassin::Util::secure_tmpfile();

431

binmode $handle or die "cannot set binmode on file $tempfile: $!";

432

433

# avoid slurping the whole file into memory, copy chunk by chunk

434

my($inbuf,$nread);

435

while ( $nread=sysread(STDIN,$inbuf,16384) )

436

{ print {$handle} $inbuf or die "error writing to $tempfile: $!" }

437

defined $nread or die "error reading from STDIN: $!";

438

close $handle or die "error closing $tempfile: $!";

439

440

# re-aim the targets at the tempfile instead of STDIN

441

$targets[$elem] =~ s/-$/$tempfile/;

442

}

443

}

444

445

# make sure the target list is in the normal AI format

446

if ($targets[$elem] !~ /^[^:]*:[a-z]+:/) {

447

my $item = splice @targets, $elem, 1;

448

target($item); # add back to the list

449

$elem--; # go back to this element again

450

next;

451

}

452

}

453

454

###########################################################################

455

456

my $iter = new Mail::SpamAssassin::ArchiveIterator(

457

{

458

'opt_all' => 0, # skip messages over 250k

459

'opt_want_date' => 0,

460

}

461

);

462

463

$iter->set_functions(\&wanted, \&result);

464

$messagecount = 0;

465

$learnedcount = 0;

466

467

$init_results = 0;

468

$start_time = time;

469

470

# if exit_status isn't already set to non-zero, set it to the reverse of the

471

# run result (0 is bad, 1+ is good -- the opposite of exit status codes)

472

eval { $exit_status ||= ! $iter->run(@targets); };

473

474

print STDERR "\n" if ($opt{showdots});

475

$progress->final() if ($opt{progress} && $progress);

476

477

my $phrase = defined $forget ? "Forgot" : "Learned";

478

print "$phrase tokens from $learnedcount message(s) ($messagecount message(s) examined)\n";

479

480

# If we needed to make a tempfile, go delete it.

481

if (defined $tempfile) {

482

unlink $tempfile or die "cannot unlink temporary file $tempfile: $!";

483

undef $tempfile;

484

}

485

486

if ($@) { die $@ unless ( $@ =~ /HITLIMIT/ ); }

487

488

} or do {

489

my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;

490

$spamtest->finish_learner();

491

die $eval_stat;

492

};

493

494

$spamtest->finish_learner();

495

# make sure we notice any write errors while flushing output buffer

496

close STDOUT or die "error closing STDOUT: $!";

497

close STDIN or die "error closing STDIN: $!";

498

exit $exit_status;

499

500

###########################################################################

501

502

sub killed {

503

$spamtest->finish_learner();

504

die "interrupted";

505

}

506

507

sub target {

508

my ($target) = @_;

509

510

my $class = ( $isspam ? "spam" : "ham" );

511

my $format = ( defined( $opt{'format'} ) ? $opt{'format'} : "detect" );

512

513

push ( @targets, "$class:$format:$target" );

514

}

515

516

###########################################################################

517

518

sub init_results {

519

$init_results = 1;

520

521

return unless $opt{'progress'};

522

523

$total_messages = $Mail::SpamAssassin::ArchiveIterator::MESSAGES;

524

525

$progress = Mail::SpamAssassin::Util::Progress->new({total => $total_messages,});

526

}

527

528

###########################################################################

529

530

sub result {

531

my ($class, $result, $time) = @_;

532

533

# don't open results files until we get here to avoid overwriting files

534

&init_results if !$init_results;

535

536

$progress->update($messagecount) if ($opt{progress} && $progress);

537

}

538

539

###########################################################################

540

541

sub wanted {

542

my ( $class, $id, $time, $dataref ) = @_;

543

544

my $spam = $class eq "s" ? 1 : 0;

545

546

if ( defined($learnprob) ) {

547

if ( int( rand( 1 / $learnprob ) ) != 0 ) {

548

print STDERR '_' if ( $opt{showdots} );

549

return 1;

550

}

551

}

552

553

if ( defined($messagelimit) && $learnedcount > $messagelimit ) {

554

$progress->final() if ($opt{progress} && $progress);

555

die 'HITLIMIT';

556

}

557

558

$messagecount++;

559

my $ma = $spamtest->parse($dataref);

560

561

if ( $ma->get_header("X-Spam-Checker-Version") ) {

562

my $new_ma = $spamtest->parse($spamtest->remove_spamassassin_markup($ma), 1);

563

$ma->finish();

564

$ma = $new_ma;

565

}

566

567

my $status = $spamtest->learn( $ma, undef, $spam, $forget );

568

my $learned = $status->did_learn();

569

570

if ( !defined $learned ) { # undef=learning unavailable

571

die "ERROR: the Bayes learn function returned an error, please re-run with -D for more information\n";

572

}

573

elsif ( $learned == 1 ) { # 1=message was learned. 0=message wasn't learned

574

$learnedcount++;

575

}

576

577

# Do cleanup ...

578

$status->finish();

579

undef $status;

580

581

$ma->finish();

582

undef $ma;

583

584

print STDERR '.' if ( $opt{showdots} );

585

return 1;

586

}

587

588

###########################################################################

589

590

sub usage {

591

my ( $verbose, $message ) = @_;

592

my $ver = Mail::SpamAssassin::Version();

593

print "SpamAssassin version $ver\n";

594

pod2usage( -verbose => $verbose, -message => $message, -exitval => 64 );

595

}

596

597

# ---------------------------------------------------------------------------

598

599

=head1 NAME

600

601

sa-learn - train SpamAssassin's Bayesian classifier

602

603

=head1 SYNOPSIS

604

605

B<sa-learn> [options] [file]...

606

607

B<sa-learn> [options] --dump [ all | data | magic ]

608

609

Options:

610

611

--ham Learn messages as ham (non-spam)

612

--spam Learn messages as spam

613

--forget Forget a message

614

--use-ignores Use bayes_ignore_from and bayes_ignore_to

615

--sync Synchronize the database and the journal if needed

616

--force-expire Force a database sync and expiry run

617

--dbpath <path> Allows commandline override (in bayes_path form)

618

for where to read the Bayes DB from

619

--dump [all|data|magic] Display the contents of the Bayes database

620

Takes optional argument for what to display

621

--regexp <re> For dump only, specifies which tokens to

622

dump based on a regular expression.

623

-f file, --folders=file Read list of files/directories from file

624

--dir Ignored; historical compatibility

625

--file Ignored; historical compatibility

626

--mbox Input sources are in mbox format

627

--mbx Input sources are in mbx format

628

--showdots Show progress using dots

629

--progress Show progress using progress bar

630

--no-sync Skip synchronizing the database and journal

631

after learning

632

-L, --local Operate locally, no network accesses

633

--import Migrate data from older version/non DB_File

634

based databases

635

--clear Wipe out existing database

636

--backup Backup, to STDOUT, existing database

637

--restore <filename> Restore a database from filename

638

-u username, --username=username

639

Override username taken from the runtime

640

environment, used with SQL

641

-C path, --configpath=path, --config-file=path

642

Path to standard configuration dir

643

-p prefs, --prefspath=file, --prefs-file=file

644

Set user preferences file

645

--siteconfigpath=path Path for site configs

646

(default: /etc/spamassassin)

647

--cf='config line' Additional line of configuration

648

-D, --debug [area=n,...] Print debugging messages

649

-V, --version Print version

650

-h, --help Print usage message

651

652

=head1 DESCRIPTION

653

654

Given a typical selection of your incoming mail classified as spam or ham

655

(non-spam), this tool will feed each mail to SpamAssassin, allowing it

656

to 'learn' what signs are likely to mean spam, and which are likely to

657

mean ham.

658

659

Simply run this command once for each of your mail folders, and it will

660

''learn'' from the mail therein.

661

662

Note that csh-style I<globbing> in the mail folder names is supported;

663

in other words, listing a folder name as C<*> will scan every folder

664

that matches. See C<Mail::SpamAssassin::ArchiveIterator> for more details.

665

666

SpamAssassin remembers which mail messages it has learnt already, and will not

667

re-learn those messages again, unless you use the B<--forget> option. Messages

668

learnt as spam will have SpamAssassin markup removed, on the fly.

669

670

If you make a mistake and scan a mail as ham when it is spam, or vice

671

versa, simply rerun this command with the correct classification, and the

672

mistake will be corrected. SpamAssassin will automatically 'forget' the

673

previous indications.

674

675

Users of C<spamd> who wish to perform training remotely, over a network,

676

should investigate the C<spamc -L> switch.

677

678

=head1 OPTIONS

679

680

=over 4

681

682

=item B<--ham>

683

684

Learn the input message(s) as ham. If you have previously learnt any of the

685

messages as spam, SpamAssassin will forget them first, then re-learn them as

686

ham. Alternatively, if you have previously learnt them as ham, it'll skip them

687

this time around. If the messages have already been filtered through

688

SpamAssassin, the learner will ignore any modifications SpamAssassin may have

689

made.

690

691

=item B<--spam>

692

693

Learn the input message(s) as spam. If you have previously learnt any of the

694

messages as ham, SpamAssassin will forget them first, then re-learn them as

695

spam. Alternatively, if you have previously learnt them as spam, it'll skip

696

them this time around. If the messages have already been filtered through

697

SpamAssassin, the learner will ignore any modifications SpamAssassin may have

698

made.

699

700

=item B<--folders>=I<filename>, B<-f> I<filename>

701

702

sa-learn will read in the list of folders from the specified file, one folder

703

per line in the file. If the folder is prefixed with C<ham:type:> or C<spam:type:>,

704

sa-learn will learn that folder appropriately, otherwise the folders will be

705

assumed to be of the type specified by B<--ham> or B<--spam>.

706

707

C<type> above is optional, but is the same as the standard for

708

ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not

709

specified).

710

711

=item B<--mbox>

712

713

sa-learn will read in the file(s) containing the emails to be learned,

714

and will process them in mbox format (one or more emails per file).

715

716

=item B<--mbx>

717

718

sa-learn will read in the file(s) containing the emails to be learned,

719

and will process them in mbx format (one or more emails per file).

720

721

=item B<--use-ignores>

722

723

Don't learn the message if a from address matches configuration file

724

item C<bayes_ignore_from> or a to address matches C<bayes_ignore_to>.

725

The option might be used when learning from a large file of messages

726

from which the hammy spam messages or spammy ham messages have not

727

been removed.

728

729

=item B<--sync>

730

731

Synchronize the journal and databases. Upon successfully syncing the

732

database with the entries in the journal, the journal file is removed.

733

734

=item B<--force-expire>

735

736

Forces an expiry attempt, regardless of whether it may be necessary

737

or not. Note: This doesn't mean any tokens will actually expire.

738

Please see the EXPIRATION section below.

739

740

Note: C<--force-expire> also causes the journal data to be synchronized

741

into the Bayes databases.

742

743

=item B<--forget>

744

745

Forget a given message previously learnt.

746

747

=item B<--dbpath>

748

749

Allows a commandline override of the I<bayes_path> configuration option.

750

751

=item B<--dump> I<option>

752

753

Display the contents of the Bayes database. Without an option or with

754

the I<all> option, all magic tokens and data tokens will be displayed.

755

I<magic> will only display magic tokens, and I<data> will only display

756

the data tokens.

757

758

Can also use the B<--regexp> I<RE> option to specify which tokens to

759

display based on a regular expression.

760

761

=item B<--clear>

762

763

Clear an existing Bayes database by removing all traces of the database.

764

765

WARNING: This is destructive and should be used with care.

766

767

=item B<--backup>

768

769

Performs a dump of the Bayes database in machine/human readable format.

770

771

The dump will include token and seen data. It is suitable for input back

772

into the --restore command.

773

774

=item B<--restore>=I<filename>

775

776

Performs a restore of the Bayes database defined by I<filename>.

777

778

WARNING: This is a destructive operation, previous Bayes data will be wiped out.

779

780

=item B<-h>, B<--help>

781

782

Print help message and exit.

783

784

=item B<-u> I<username>, B<--username>=I<username>

785

786

If specified this username will override the username taken from the runtime

787

environment. You can use this option to specify users in a virtual user

788

configuration when using SQL as the Bayes backend.

789

790

NOTE: This option will not change to the given I<username>, it will only attempt

791

to act on behalf of that user. Because of this you will need to have proper

792

permissions to be able to change files owned by I<username>. In the case of SQL

793

this generally is not a problem.

794

795

=item B<-C> I<path>, B<--configpath>=I<path>, B<--config-file>=I<path>

796

797

Use the specified path for locating the distributed configuration files.

798

Ignore the default directories (usually C</usr/share/spamassassin> or similar).

799

800

=item B<--siteconfigpath>=I<path>

801

802

Use the specified path for locating site-specific configuration files. Ignore

803

the default directories (usually C</etc/spamassassin> or similar).

804

805

=item B<--cf='config line'>

806

807

Add additional lines of configuration directly from the command-line, parsed

808

after the configuration files are read. Multiple B<--cf> arguments can be

809

used, and each will be considered a separate line of configuration.

810

811

=item B<-p> I<prefs>, B<--prefspath>=I<prefs>, B<--prefs-file>=I<prefs>

812

813

Read user score preferences from I<prefs> (usually C<$HOME/.spamassassin/user_prefs>).

814

815

=item B<--progress>

816

817

Prints a progress bar (to STDERR) showing the current progress. In the case

818

where no valid terminal is found this option will behave very much like the

819

--showdots option.

820

821

=item B<-D> [I<area,...>], B<--debug> [I<area,...>]

822

823

Produce debugging output. If no areas are listed, all debugging information is

824

printed. Diagnostic output can also be enabled for each area individually;

825

I<area> is the area of the code to instrument. For example, to produce

826

diagnostic output on bayes, learn, and dns, use:

827

828

spamassassin -D bayes,learn,dns

829

830

For more information about which areas (also known as channels) are available,

831

please see the documentation at:

832

833

C<http://wiki.apache.org/spamassassin/DebugChannels>

834

835

Higher priority informational messages that are suitable for logging in normal

836

circumstances are available with an area of "info".

837

838

=item B<--no-sync>

839

840

Skip the slow synchronization step which normally takes place after

841

changing database entries. If you plan to learn from many folders in

842

a batch, or to learn many individual messages one-by-one, it is faster

843

to use this switch and run C<sa-learn --sync> once all the folders have

844

been scanned.

845

846

Clarification: The state of I<--no-sync> overrides the

847

I<bayes_learn_to_journal> configuration option. If not specified,

848

sa-learn will learn to the database directly. If specified, sa-learn

849

will learn to the journal file.

850

851

Note: I<--sync> and I<--no-sync> can be specified on the same commandline,

852

which is slightly confusing. In this case, the I<--no-sync> option is

853

ignored since there is no learn operation.

854

855

=item B<-L>, B<--local>

856

857

Do not perform any network accesses while learning details about the mail

858

messages. This will speed up the learning process, but may result in a

859

slightly lower accuracy.

860

861

Note that this is currently ignored, as current versions of SpamAssassin will

862

not perform network access while learning; but future versions may.

863

864

=item B<--import>

865

866

If you previously used SpamAssassin's Bayesian learner without the C<DB_File>

867

module installed, it will have created files in other formats, such as

868

C<GDBM_File>, C<NDBM_File>, or C<SDBM_File>. This switch allows you to migrate

869

that old data into the C<DB_File> format. It will overwrite any data currently

870

in the C<DB_File>.

871

872

Can also be used with the B<--dbpath> I<path> option to specify the location of

873

the Bayes files to use.

874

875

=back

876

877

=head1 MIGRATION

878

879

There are now multiple backend storage modules available for storing

880

user's bayesian data. As such you might want to migrate from one

881

backend to another. Here is a simple procedure for migrating from one

882

backend to another.

883

884

Note that if you have individual user databases you will have to

885

perform a similar procedure for each one of them.

886

887

=over 4

888

889

=item sa-learn --sync

890

891

This will sync any outstanding journal entries

892

893

=item sa-learn --backup > backup.txt

894

895

This will save all your Bayes data to a plain text file.

896

897

=item sa-learn --clear

898

899

This is optional, but good to do to clear out the old database.

900

901

=item Repeat!

902

903

At this point, if you have multiple databases, you should perform the

904

procedure above for each of them. (i.e. each user's database needs to

905

be backed up before continuing.)

906

907

=item Switch backends

908

909

Once you have backed up all databases you can update your

910

configuration for the new database backend. This will involve at least

911

the bayes_store_module config option and may involve some additional

912

config options depending on what is required by the module. (For

913

example, you may need to configure an SQL database.)

914

915

=item sa-learn --restore backup.txt

916

917

Again, you need to do this for every database.

918

919

=back

920

921

If you are migrating to SQL you can make use of the -u <username>

922

option in sa-learn to populate each user's database. Otherwise, you

923

must run sa-learn as the user who database you are restoring.

924

925

926

=head1 INTRODUCTION TO BAYESIAN FILTERING

927

928

(Thanks to Michael Bell for this section!)

929

930

For a more lengthy description of how this works, go to

931

http://www.paulgraham.com/ and see "A Plan for Spam". It's reasonably

932

readable, even if statistics make me break out in hives.

933

934

The short semi-inaccurate version: Given training, a spam heuristics engine

935

can take the most "spammy" and "hammy" words and apply probabilistic

936

analysis. Furthermore, once given a basis for the analysis, the engine can

937

continue to learn iteratively by applying both the non-Bayesian and Bayesian

938

rulesets together to create evolving "intelligence".

939

940

SpamAssassin 2.50 and later supports Bayesian spam analysis, in

941

the form of the BAYES rules. This is a new feature, quite powerful,

942

and is disabled until enough messages have been learnt.

943

944

The pros of Bayesian spam analysis:

945

946

=over 4

947

948

=item Can greatly reduce false positives and false negatives.

949

950

It learns from your mail, so it is tailored to your unique e-mail flow.

951

952

=item Once it starts learning, it can continue to learn from SpamAssassin

953

and improve over time.

954

955

=back

956

957

And the cons:

958

959

=over 4

960

961

=item A decent number of messages are required before results are useful

962

for ham/spam determination.

963

964

=item It's hard to explain why a message is or isn't marked as spam.

965

966

i.e.: a straightforward rule, that matches, say, "VIAGRA" is

967

easy to understand. If it generates a false positive or false negative,

968

it is fairly easy to understand why.

969

970

With Bayesian analysis, it's all probabilities - "because the past says

971

it is likely as this falls into a probabilistic distribution common to past

972

spam in your systems". Tell that to your users! Tell that to the client

973

when he asks "what can I do to change this". (By the way, the answer in

974

this case is "use whitelisting".)

975

976

=item It will take disk space and memory.

977

978

The databases it maintains take quite a lot of resources to store and use.

979

980

=back

981

982

=head1 GETTING STARTED

983

984

Still interested? Ok, here's the guidelines for getting this working.

985

986

First a high-level overview:

987

988

=over 4

989

990

=item Build a significant sample of both ham and spam.

991

992

I suggest several thousand of each, placed in SPAM and HAM directories or

993

mailboxes. Yes, you MUST hand-sort this - otherwise the results won't be much

994

better than SpamAssassin on its own. Verify the spamminess/haminess of EVERY

995

message. You're urged to avoid using a publicly available corpus (sample) -

996

this must be taken from YOUR mail server, if it is to be statistically useful.

997

Otherwise, the results may be pretty skewed.

998

999

=item Use this tool to teach SpamAssassin about these samples, like so:

1000

1001

sa-learn --spam /path/to/spam/folder

1002

sa-learn --ham /path/to/ham/folder

1003

...

1004

1005

Let SpamAssassin proceed, learning stuff. When it finds ham and spam

1006

it will add the "interesting tokens" to the database.

1007

1008

=item If you need SpamAssassin to forget about specific messages, use

1009

the B<--forget> option.

1010

1011

This can be applied to either ham or spam that has run through the

1012

B<sa-learn> processes. It's a bit of a hammer, really, lowering the

1013

weighting of the specific tokens in that message (only if that message has

1014

been processed before).

1015

1016

=item Learning from single messages uses a command like this:

1017

1018

sa-learn --ham --no-sync mailmessage

1019

1020

This is handy for binding to a key in your mail user agent. It's very fast, as

1021

all the time-consuming stuff is deferred until you run with the C<--sync>

1022

option.

1023

1024

=item Autolearning is enabled by default

1025

1026

If you don't have a corpus of mail saved to learn, you can let

1027

SpamAssassin automatically learn the mail that you receive. If you are

1028

autolearning from scratch, the amount of mail you receive will determine

1029

how long until the BAYES_* rules are activated.

1030

1031

=back

1032

1033

=head1 EFFECTIVE TRAINING

1034

1035

Learning filters require training to be effective. If you don't train

1036

them, they won't work. In addition, you need to train them with new

1037

messages regularly to keep them up-to-date, or their data will become

1038

stale and impact accuracy.

1039

1040

You need to train with both spam I<and> ham mails. One type of mail

1041

alone will not have any effect.

1042

1043

Note that if your mail folders contain things like forwarded spam,

1044

discussions of spam-catching rules, etc., this will cause trouble. You

1045

should avoid scanning those messages if possible. (An easy way to do this

1046

is to move them aside, into a folder which is not scanned.)

1047

1048

If the messages you are learning from have already been filtered through

1049

SpamAssassin, the learner will compensate for this. In effect, it learns what

1050

each message would look like if you had run C<spamassassin -d> over it in

1051

advance.

1052

1053

Another thing to be aware of, is that typically you should aim to train

1054

with at least 1000 messages of spam, and 1000 ham messages, if

1055

possible. More is better, but anything over about 5000 messages does not

1056

improve accuracy significantly in our tests.

1057

1058

Be careful that you train from the same source -- for example, if you train

1059

on old spam, but new ham mail, then the classifier will think that

1060

a mail with an old date stamp is likely to be spam.

1061

1062

It's also worth noting that training with a very small quantity of

1063

ham, will produce atrocious results. You should aim to train with at

1064

least the same amount (or more if possible!) of ham data than spam.

1065

1066

On an on-going basis, it is best to keep training the filter to make

1067

sure it has fresh data to work from. There are various ways to do

1068

this:

1069

1070

=over 4

1071

1072

=item 1. Supervised learning

1073

1074

This means keeping a copy of all or most of your mail, separated into spam

1075

and ham piles, and periodically re-training using those. It produces

1076

the best results, but requires more work from you, the user.

1077

1078

(An easy way to do this, by the way, is to create a new folder for

1079

'deleted' messages, and instead of deleting them from other folders,

1080

simply move them in there instead. Then keep all spam in a separate

1081

folder and never delete it. As long as you remember to move misclassified

1082

mails into the correct folder set, it is easy enough to keep up to date.)

1083

1084

=item 2. Unsupervised learning from Bayesian classification

1085

1086

Another way to train is to chain the results of the Bayesian classifier

1087

back into the training, so it reinforces its own decisions. This is only

1088

safe if you then retrain it based on any errors you discover.

1089

1090

SpamAssassin does not support this method, due to experimental results

1091

which strongly indicate that it does not work well, and since Bayes is

1092

only one part of the resulting score presented to the user (while Bayes

1093

may have made the wrong decision about a mail, it may have been overridden

1094

by another system).

1095

1096

=item 3. Unsupervised learning from SpamAssassin rules

1097

1098

Also called 'auto-learning' in SpamAssassin. Based on statistical

1099

analysis of the SpamAssassin success rates, we can automatically train the

1100

Bayesian database with a certain degree of confidence that our training

1101

data is accurate.

1102

1103

It should be supplemented with some supervised training in addition, if

1104

possible.

1105

1106

This is the default, but can be turned off by setting the SpamAssassin

1107

configuration parameter C<bayes_auto_learn> to 0.

1108

1109

=item 4. Mistake-based training

1110

1111

This means training on a small number of mails, then only training on

1112

messages that SpamAssassin classifies incorrectly. This works, but it

1113

takes longer to get it right than a full training session would.

1114

1115

=back

1116

1117

=head1 FILES

1118

1119

B<sa-learn> and the other parts of SpamAssassin's Bayesian learner,

1120

use a set of persistent database files to store the learnt tokens, as follows.

1121

1122

=over 4

1123

1124

=item bayes_toks

1125

1126

The database of tokens, containing the tokens learnt, their count of

1127

occurrences in ham and spam, and the timestamp when the token was last

1128

seen in a message.

1129

1130

This database also contains some 'magic' tokens, as follows: the version

1131

number of the database, the number of ham and spam messages learnt, the

1132

number of tokens in the database, and timestamps of: the last journal

1133

sync, the last expiry run, the last expiry token reduction count, the

1134

last expiry timestamp delta, the oldest token timestamp in the database,

1135

and the newest token timestamp in the database.

1136

1137

This is a database file, using C<DB_File>. The database 'version

1138

number' is 0 for databases from 2.5x, 1 for databases from certain 2.6x

1139

development releases, 2 for 2.6x, and 3 for 3.0 and later releases.

1140

1141

=item bayes_seen

1142

1143

A map of Message-Id and some data from headers and body to what that

1144

message was learnt as. This is used so that SpamAssassin can avoid

1145

re-learning a message it has already seen, and so it can reverse the

1146

training if you later decide that message was learnt incorrectly.

1147

1148

This is a database file, using C<DB_File>.

1149

1150

=item bayes_journal

1151

1152

While SpamAssassin is scanning mails, it needs to track which tokens

1153

it uses in its calculations. To avoid the contention of having each

1154

SpamAssassin process attempting to gain write access to the Bayes DB,

1155

the token timestamps are written to a 'journal' file which will later

1156

(either automatically or via C<sa-learn --sync>) be used to synchronize

1157

the Bayes DB.

1158

1159

Also, through the use of C<bayes_learn_to_journal>, or when using the

1160

C<--no-sync> option with sa-learn, the actual learning data will take

1161

be placed into the journal for later synchronization. This is typically

1162

useful for high-traffic sites to avoid the same contention as stated

1163

above.

1164

1165

=back

1166

1167

=head1 EXPIRATION

1168

1169

Since SpamAssassin can auto-learn messages, the Bayes database files

1170

could increase perpetually until they fill your disk. To control this,

1171

SpamAssassin performs journal synchronization and bayes expiration

1172

periodically when certain criteria (listed below) are met.

1173

1174

SpamAssassin can sync the journal and expire the DB tokens either

1175

manually or opportunistically. A journal sync is due if I<--sync>

1176

is passed to sa-learn (manual), or if the following is true

1177

(opportunistic):

1178

1179

=over 4

1180

1181

=item - bayes_journal_max_size does not equal 0 (means don't sync)

1182

1183

=item - the journal file exists

1184

1185

=back

1186

1187

and either:

1188

1189

=over 4

1190

1191

=item - the journal file has a size greater than bayes_journal_max_size

1192

1193

=back

1194

1195

1196

1197

=over 4

1198

1199

=item - a journal sync has previously occurred, and at least 1 day has

1200

passed since that sync

1201

1202

=back

1203

1204

Expiry is due if I<--force-expire> is passed to sa-learn (manual),

1205

or if all of the following are true (opportunistic):

1206

1207

=over 4

1208

1209

=item - the last expire was attempted at least 12hrs ago

1210

1211

=item - bayes_auto_expire does not equal 0

1212

1213

=item - the number of tokens in the DB is > 100,000

1214

1215

=item - the number of tokens in the DB is > bayes_expiry_max_db_size

1216

1217

=item - there is at least a 12 hr difference between the oldest and newest token atimes

1218

1219

=back

1220

1221

=head2 EXPIRE LOGIC

1222

1223

If either the manual or opportunistic method causes an expire run

1224

to start, here is the logic that is used:

1225

1226

=over 4

1227

1228

=item - figure out how many tokens to keep. take the larger of

1229

either bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal

1230

reduction is number of tokens - number of tokens to keep.

1231

1232

=item - if the reduction number is < 1000 tokens, abort (not worth the effort).

1233

1234

=item - if an expire has been done before, guesstimate the new

1235

atime delta based on the old atime delta. (new_atime_delta =

1236

old_atime_delta * old_reduction_count / goal)

1237

1238

=item - if no expire has been done before, or the last expire looks

1239

"weird", do an estimation pass. The definition of "weird" is:

1240

1241

=over 8

1242

1243

=item - last expire over 30 days ago

1244

1245

=item - last atime delta was < 12 hrs

1246

1247

=item - last reduction count was < 1000 tokens

1248

1249

=item - estimated new atime delta is < 12 hrs

1250

1251

=item - the difference between the last reduction count and the goal reduction count is > 50%

1252

1253

=back

1254

1255

=back

1256

1257

=head2 ESTIMATION PASS LOGIC

1258

1259

Go through each of the DB's tokens. Starting at 12hrs, calculate

1260

whether or not the token would be expired (based on the difference

1261

between the token's atime and the db's newest token atime) and keep

1262

the count. Work out from 12hrs exponentially by powers of 2. ie:

1263

12hrs * 1, 12hrs * 2, 12hrs * 4, 12hrs * 8, and so on, up to 12hrs

1264

* 512 (6144hrs, or 256 days).

1265

1266

The larger the delta, the smaller the number of tokens that will

1267

be expired. Conversely, the number of tokens goes up as the delta

1268

gets smaller. So starting at the largest atime delta, figure out

1269

which delta will expire the most tokens without going above the

1270

goal expiration count. Use this to choose the atime delta to use,

1271

unless one of the following occurs:

1272

1273

=over 8

1274

1275

=item - the largest atime (smallest reduction count) would expire

1276

too many tokens. this means the learned tokens are mostly old and

1277

there needs to be new tokens learned before an expire can

1278

occur.

1279

1280

=item - all of the atime choices result in 0 tokens being removed.

1281

this means the tokens are all newer than 12 hours and there needs

1282

to be new tokens learned before an expire can occur.

1283

1284

=item - the number of tokens that would be removed is < 1000. the

1285

benefit isn't worth the effort. more tokens need to be learned.

1286

1287

=back

1288

1289

If the expire run gets past this point, it will continue to the end.

1290

A new DB is created since the majority of DB libraries don't shrink the

1291

DB file when tokens are removed. So we do the "create new, migrate old

1292

to new, remove old, rename new" shuffle.

1293

1294

=head2 EXPIRY RELATED CONFIGURATION SETTINGS

1295

1296

=over 4

1297

1298

=item C<bayes_auto_expire> is used to specify whether or not SpamAssassin

1299

ought to opportunistically attempt to expire the Bayes database.

1300

The default is 1 (yes).

1301

1302

=item C<bayes_expiry_max_db_size> specifies both the auto-expire token

1303

count point, as well as the resulting number of tokens after expiry

1304

as described above. The default value is 150,000, which is roughly

1305

equivalent to a 6Mb database file if you're using DB_File.

1306

1307

=item C<bayes_journal_max_size> specifies how large the Bayes

1308

journal will grow before it is opportunistically synced. The

1309

default value is 102400.

1310

1311

=back

1312

1313

=head1 INSTALLATION

1314

1315

The B<sa-learn> command is part of the B<Mail::SpamAssassin> Perl module.

1316

Install this as a normal Perl module, using C<perl -MCPAN -e shell>,

1317

or by hand.

1318

1319

=head1 SEE ALSO

1320

1321

spamassassin(1)

1322

spamc(1)

1323

Mail::SpamAssassin(3)

1324

Mail::SpamAssassin::ArchiveIterator(3)

1325

1326

E<lt>http://www.paulgraham.com/E<gt>

1327

Paul Graham's "A Plan For Spam" paper

1328

1329

E<lt>http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.htmlE<gt>

1330

Gary Robinson's f(x) and combining algorithms, as used in SpamAssassin

1331

1332

E<lt>http://www.bgl.nu/~glouis/bogofilter/E<gt>

1333

'Training on error' page. A discussion of various Bayes training regimes,

1334

including 'train on error' and unsupervised training.

1335

1336

=head1 PREREQUISITES

1337

1338

C<Mail::SpamAssassin>

1339

1340

=head1 AUTHORS

1341

1342

The SpamAssassin(tm) Project E<lt>http://spamassassin.apache.org/E<gt>

1343

1344

=cut

1345

Older »