"Untitled" by Gerd Altmann is licensed under Pixabay License.

How to Train SpamAssassin

Posted on December 12, 2020

GeekThis happily runs on Vultr. Get $300 of free hosting credits to try out their cloud compute, kubernetes engine, or managed databases. Try Vultr today to claim your free $300.

SpamAssassin won’t do much if it hasn’t been trained. While it does come with a few plugins enabled for DKIM, SPF, RBL, and content checks, SpamAssassin is limited unless you train its Bayesian filter. The Bayesian filter will compare past content from known spam and ham emails to determine the likelihood of spam.

Bayes’ theorem, named after 18th-century British mathematician Thomas Bayes, is a mathematical formula for determining conditional probability. Conditional probability is the likelihood of an outcome occurring, based on a previous outcome occurring.

https://www.investopedia.com/terms/b/bayes-theorem.asp

Training SpamAssassin against your own data is preferred, but it’s only effective if you have a significant amount of spam and ham available. There are a few online databases available to initially feed into SpamAssassin’s Bayesian database. Once the initial training is done, it’s recommended to routinely train SpamAssassin as spam is always changing and the more training you do with your own set of data the more accurate the filter will become.

Using SA-Learn for Training

The tool to train SpamAssassin is sa-learn. In default usage, it will take a directory of spam or ham emails and add their tokens to the database. A token is a sequence of words or short characters that are commonly found in spam or ham.

It’s important to run sa-learn as the same user who starts spamc inside of your mail content filter.

You can either manually run sa-learn or preferably add it to a cron job to routinely update the database. The utility sa-learn will ignore emails that have already been processed to prevent adding extra weight to certain tokens.

The below commands will learn spam and ham respectively from a folder containing emails. This is the common method for use with the Maildir format.

$ sa-learn --spam /path/to/spam/folder
$ sa-learn --ham /path/to/ham/folder

It’s also possible to teach SpamAssassin from a single email or using mbox or mbx formats. Read about additional options on the man page.

$ sa-learn --spam /path/to/spam.email
$ sa-learn --mbox /var/mail/user
$ sa-learn --mbx /var/mbx/mail/test

SpamAssassin’s learning utility can handle wildcards inside of the path. This is helpful to quickly update the Bayes database for many users. You may also use curly braces to identify one of many possible folder names in the patch as well.

$ sa-learn --spam /var/vmail/*/Maildir/Spam/{cur,new}
$ sa-learn --ham /var/vmail/*/Maildir/cur

Initial Training Data

There are a few sources of spam and ham emails online available to download. There are also available SpamAssassin backups that you can restore to get started. Using public spam data is helpful to get started but may not be specific to your use case. It’s important to keep training SpamAssassin with incoming emails you receive.

ArtInvoice.hu Spam Archive

This is my preferred data source for training as it has an initial database you can restore. Every day the service archives newly received spam for you to use to train with. It appears that since 2020 most of their archives don’t have new messages.

http://artinvoice.hu/spams/

Untroubled.org Spam Archive

An archive of spam received since 1998. Training with emails that old won’t be of much help, but they are still active and update their archives daily. Definitely a good source to train with.

http://untroubled.org/spam/

Old SpamAssassin Data

The old public corpus data from SpamAssassin is still available, but it’s from 2002 to 2005. I would avoid training with this data since spam has evolved.

https://spamassassin.apache.org/old/publiccorpus/

Viewing Trained Data

You can view the trained spam and ham data by using the sa-learn --dump [all|data|magic] command. It’s not possible to see the tokens in the database though, as they have been hashed.

You’ll be able to see the number of spam and ham emails that have been added to the database. Along with the number of tokens, when the journal was synced, and expiry options of tokens.

$ sa-learn --dump magic
0.000        0            3        0  non-token data: bayes db version
0.000        0      1441389        0  non-token data: nspam
0.000        0       516849        0  non-token data: nham
0.000        0       166775        0  non-token data: ntokens
0.000        0   1574031977        0  non-token data: oldest atime
0.000        0   1607720754        0  non-token data: newest atime
0.000        0   1607720758        0  non-token data: last journal sync atime
0.000        0   1607720761        0  non-token data: last expiry atime
0.000        0            0        0  non-token data: last expire atime delta
0.000        0            0        0  non-token data: last expire reduction count

You can also view encoded token data using a similar call. The output is formatted into five fields delimited with whitespace. The fields in order left to right are as follows:

The probability that the token is spam
Number of spam emails with the token
Number of ham emails with the token
The time the token was last accessed while training
The encoded version of the token

$ sa-learn --dump data
0.995          3          0 1575289038  008a1fb253
0.008          0          2 1607626751  03b82a68a9
0.993          2          0 1576254819  076aeef7fa
0.999         12          0 1574718455  0919d38b9c
0.987          1          0 1575615165  09cdb989a8
0.987          1          0 1574931501  0bedcc3ea2
0.016          0          1 1575120730  0e9e73e4fb
0.987          1          0 1576312660  10ae0462e8
1.000        277          0 1576750116  10f08309d0
0.008          0          2 1575368277  11e21177a3

Trained Data Storage

Depending on your SpamAssassin setup, Bayes data can be stored in Berkeley DB, DBM, MySQL, PostgreSQL, SDBM, or Redis. Configuring other storage methods is outside the scope of this tutorial.

With default settings, trained data is stored inside DBM database files bayes_journal, bayes_seen, and bayes_toks. These files can be found in the user’s home directory inside the folder .spamassassin. The sa-learn tool handles all of these files for you.

Backup and Restore Database

Backup and restoration of SpamAssassin can also be performed using the sa-learn command. When running a multiple user email server, you’ll have to back up each user’s Bayes database unless you’re using a server-wide database.

$ sa-learn --backup > /var/backup/spamassassin.bak
$ sa-learn --restore /var/backup/spamassassin.bak

The backup will contain the learned tokens and the seen emails in a single backed-up file.

Cron Jobs and Scripting

When using sa-learn inside of a script, you can increase performance by using the --sync and --no-sync options. The learning utility is slow when it has to sync data to the Bayesian database. If you’re performing many sa-learn commands, it’s better to write to the journal and then at the end sync with the database.

Below is an example program that shows that all calls include the --no-sync option and then at the end we sync the journal and database.

#!/usr/bin/env bash

# Perform some repeatable logic
for folder in /var/mail/*; do
    sa-learn --no-sync --spam "${folder}/spam"
    sa-learn --no-sync --ham "${folder}/ham"
done

sa-learn --sync

Troubleshooting

Below are common issues you may come across. I know that I fell for a few of them in the past.

Spam is still getting through!

Make sure you trained the correct databases. When you run sa-learn it trains the database for said user. Your MTA likely calls spamc as a dedicated user, such as debian-spamd. You will need to run sa-learn as the same user as spamc.

This can be verified by checking the home directory of each user. Check for the folder .spamassassin. You may find that you trained data for the root user by mistake. As a note, debian-spamd has the home of /var/lib/spamassassin.