Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: hmuhammad on December 29, 2011, 05:25:51 PM

Title: configuring spamassassin to autolearn versus RBL's to fight spam
Post by: hmuhammad on December 29, 2011, 05:25:51 PM

Because the users were complaining about email being rejected for customers using yahoo.com, verizon.net and others, the smeserver default RBL's where changed..

per... http://wiki.contribs.org/Email#Real-time_Blackhole_List_.28RBL.29,
which says: Many will argue what's best, some say the SME defaults are too aggressive and affect some popular free webmail accounts, but most would agree that you can set stable, conservative and non aggressive settings by,
changing to: config setprop qpsmtpd RBLList zen.spamhaus.org:whois.rfc-ignorant.org:dnsbl.njabl.org
signal-event email-update

Now the site is receiving to much junkmail in user's inboxes, so we're exploring spamassassin autolearning.

Does anyone have any experience or insight about configuring spamassassin autolearning, particularly BayesAutoLearnThresholdSpam and BayesAutoLearnThresholdNonSpam?

Any comments on the following two different approaches to setting BayesAutoLearnThresholdSpam and BayesAutoLearnThresholdNonSpam?

Approach #1
From... http://wiki.contribs.org/Email#Bayesian_Autolearning

Bayesian Autolearning

The default SME settings do not include bayesian filtering in spamassassin to allow spamassassin to learn from received email and improve over time.
The following command will enable the bayesian learning filter and set thresholds for the bayesian filter.

config setprop spamassassin UseBayes 1
config setprop spamassassin BayesAutoLearnThresholdSpam 4.00
config setprop spamassassin BayesAutoLearnThresholdNonspam 0.10
expand-template /etc/mail/spamassassin/local.cf
sa-learn --sync --dbpath /var/spool/spamd/.spamassassin -u spamd
chown spamd.spamd /var/spool/spamd/.spamassassin/bayes_*
chown spamd.spamd /var/spool/spamd/.spamassassin/bayes.mutex
chmod 640 /var/spool/spamd/.spamassassin/bayes_*
config setprop spamassassin status enabled
config setprop spamassassin RejectLevel 12
config setprop spamassassin TagLevel 4
config setprop spamassassin Sensitivity custom
signal-event email-update

These commands will:

enable spamassassin
configure spamassassin to reject any email with a score above 12
tag spam scored between 4 and 12 in the email header
enable bayesian filter
'autolearn' as SPAM any email with a score above 4.00
'autolearn' as HAM any email with a score below 0.10

Approach #2
From... http://www.maiamailguard.com/maia/wiki/sa-autolearn

SpamAssassin offers an "auto-learn" mechanism for training your Bayes database automatically, so long as the mail being scanned scores conservatively enough. You can enable this feature and define these conservative thresholds in your local.cf file:

bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam -5.0
bayes_auto_learn_threshold_spam 15.0

The bayes_auto_learn_threshold_nonspam setting defines the cutoff level for SpamAssassin to auto-learn a non-spam item. As long as the item scores at or below this threshold, it will be learned automatically as non-spam. Since there aren't as many rules designed to identify non-spam as there are for identifying spam, this threshold usually doesn't need to be far below 0; a value of -5 is plenty in most cases.

The bayes_auto_learn_threshold_spam setting works the same way for spam, except that in this case it applies to items that score at or above this threshold. A value of 15 or so is conservative enough in most cases, though you can also examine the system-wide statistics for your site to find out what the highest-scoring false-positive was, and use that as a starting point.

Mail that scores anywhere between those two thresholds will not automatically be learned by the Bayes engine, so they will need to be confirmed as spam or non-spam by human beings using Maia's web interface if they are to be used for learning purposes at all.

Thanks,
Hasan

Update:

After executing...

find /home/e-smith/files/users/SOMEUSER/Maildir/.junkmail/cur/ -type f -print0 | xargs -0 egrep -il 'X-Spam-Status:.*=4.0' | xargs egrep -h ^Subject:|less
...and...
find /home/e-smith/files/users/SOMEUSER/Maildir/cur/ -type f -print0 | xargs -0 egrep -il 'X-Spam-Status:.*=4.0' | xargs egrep -h ^Subject:|less
...and for ...Status:.*=3.0 & 2.0 & 1.0

...and discovering many emails rated between 1.0 & 4.0 which seem to be spam,
...and similarly discovering many emails rated over 4.0 which seem to be ham,

...then, it would seem to be prudent to use something like...

config setprop spamassassin BayesAutoLearnThresholdSpam 12.00
config setprop spamassassin BayesAutoLearnThresholdNonspam -5.0

Title: Re: configuring spamassassin to autolearn versus RBL's to fight spam
Post by: mmccarn on December 29, 2011, 10:48:07 PM

The rpm from michaelw described here sets up bayesian autolearning automatically:
http://www.sonoracomm.com/support/19-inet-support/49-spam-filter-configuration-for-sme-7

You can do the same thing from the command shell using the commands from the wiki:
http://wiki.contribs.org/Email#Bayesian_Autolearning

Here's an outline of how I setup my SME servers:
http://forums.contribs.org/index.php/topic,33824.msg145697.html#msg145697