How to Train SpamAssassin
SpamAssassin won’t do much if it hasn’t been trained. While it does come with a few plugins enabled for DKIM, SPF, RBL, and content checks, SpamAssassin is limited unless you train its Bayesian filter. The Bayesian filter will compare past content from known spam and ham emails to determine the likelihood of spam.
Bayes’ theorem, named after 18th-century British mathematician Thomas Bayes, is a mathematical formula for determining conditional probability. Conditional probability is the likelihood of an outcome occurring, based on a previous outcome occurring.
Training SpamAssassin against your own data is preferred, but it’s only effective if you have a significant amount of spam and ham available. There are a few online databases available to initially feed into SpamAssassin’s Bayesian database. Once the initial training is done, it’s recommended to routinely train SpamAssassin as spam is always changing and the more training you do with your own set of data the more accurate the filter will become.
Using SA-Learn for Training
The tool to train SpamAssassin is sa-learn
. In default usage, it will take a directory of spam or ham emails and add their tokens to the database. A token is a sequence of words or short characters that are commonly found in spam or ham.
It’s important to run sa-learn
as the same user who starts spamc
inside of your mail content filter.
You can either manually run sa-learn
or preferably add it to a cron job to routinely update the database. The utility sa-learn
will ignore emails that have already been processed to prevent adding extra weight to certain tokens.
The below commands will learn spam and ham respectively from a folder containing emails. This is the common method for use with the Maildir format.
$ sa-learn --spam /path/to/spam/folder
$ sa-learn --ham /path/to/ham/folder
It’s also possible to teach SpamAssassin from a single email or using mbox or mbx formats. Read about additional options on the man page.
$ sa-learn --spam /path/to/spam.email
$ sa-learn --mbox /var/mail/user
$ sa-learn --mbx /var/mbx/mail/test
SpamAssassin’s learning utility can handle wildcards inside of the path. This is helpful to quickly update the Bayes database for many users. You may also use curly braces to identify one of many possible folder names in the patch as well.
$ sa-learn --spam /var/vmail/*/Maildir/Spam/{cur,new}
$ sa-learn --ham /var/vmail/*/Maildir/cur
Initial Training Data
There are a few sources of spam and ham emails online available to download. There are also available SpamAssassin backups that you can restore to get started. Using public spam data is helpful to get started but may not be specific to your use case. It’s important to keep training SpamAssassin with incoming emails you receive.
ArtInvoice.hu Spam Archive
This is my preferred data source for training as it has an initial database you can restore. Every day the service archives newly received spam for you to use to train with. It appears that since 2020 most of their archives don’t have new messages.
Untroubled.org Spam Archive
An archive of spam received since 1998. Training with emails that old won’t be of much help, but they are still active and update their archives daily. Definitely a good source to train with.
Old SpamAssassin Data
The old public corpus data from SpamAssassin is still available, but it’s from 2002 to 2005. I would avoid training with this data since spam has evolved.
https://spamassassin.apache.org/old/publiccorpus/
Viewing Trained Data
You can view the trained spam and ham data by using the sa-learn --dump [all|data|magic]
command. It’s not possible to see the tokens in the database though, as they have been hashed.
You’ll be able to see the number of spam and ham emails that have been added to the database. Along with the number of tokens, when the journal was synced, and expiry options of tokens.
$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 1441389 0 non-token data: nspam
0.000 0 516849 0 non-token data: nham
0.000 0 166775 0 non-token data: ntokens
0.000 0 1574031977 0 non-token data: oldest atime
0.000 0 1607720754 0 non-token data: newest atime
0.000 0 1607720758 0 non-token data: last journal sync atime
0.000 0 1607720761 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire reduction count
You can also view encoded token data using a similar call. The output is formatted into five fields delimited with whitespace. The fields in order left to right are as follows:
- The probability that the token is spam
- Number of spam emails with the token
- Number of ham emails with the token
- The time the token was last accessed while training
- The encoded version of the token
$ sa-learn --dump data
0.995 3 0 1575289038 008a1fb253
0.008 0 2 1607626751 03b82a68a9
0.993 2 0 1576254819 076aeef7fa
0.999 12 0 1574718455 0919d38b9c
0.987 1 0 1575615165 09cdb989a8
0.987 1 0 1574931501 0bedcc3ea2
0.016 0 1 1575120730 0e9e73e4fb
0.987 1 0 1576312660 10ae0462e8
1.000 277 0 1576750116 10f08309d0
0.008 0 2 1575368277 11e21177a3
Trained Data Storage
Depending on your SpamAssassin setup, Bayes data can be stored in Berkeley DB, DBM, MySQL, PostgreSQL, SDBM, or Redis. Configuring other storage methods is outside the scope of this tutorial.
With default settings, trained data is stored inside DBM database files bayes_journal
, bayes_seen
, and bayes_toks
. These files can be found in the user’s home directory inside the folder .spamassassin
. The sa-learn
tool handles all of these files for you.
Backup and Restore Database
Backup and restoration of SpamAssassin can also be performed using the sa-learn
command. When running a multiple user email server, you’ll have to back up each user’s Bayes database unless you’re using a server-wide database.
$ sa-learn --backup > /var/backup/spamassassin.bak
$ sa-learn --restore /var/backup/spamassassin.bak
The backup will contain the learned tokens and the seen emails in a single backed-up file.
Cron Jobs and Scripting
When using sa-learn
inside of a script, you can increase performance by using the --sync
and --no-sync
options. The learning utility is slow when it has to sync data to the Bayesian database. If you’re performing many sa-learn
commands, it’s better to write to the journal and then at the end sync with the database.
Below is an example program that shows that all calls include the --no-sync
option and then at the end we sync the journal and database.
#!/usr/bin/env bash
# Perform some repeatable logic
for folder in /var/mail/*; do
sa-learn --no-sync --spam "${folder}/spam"
sa-learn --no-sync --ham "${folder}/ham"
done
sa-learn --sync
Troubleshooting
Below are common issues you may come across. I know that I fell for a few of them in the past.
Spam is still getting through!
Make sure you trained the correct databases. When you run sa-learn
it trains the database for said user. Your MTA likely calls spamc
as a dedicated user, such as debian-spamd
. You will need to run sa-learn
as the same user as spamc
.
This can be verified by checking the home directory of each user. Check for the folder .spamassassin
. You may find that you trained data for the root user by mistake. As a note, debian-spamd
has the home of /var/lib/spamassassin
.