Koozali.org: home of the SME Server

[ANNOUNNCE] Yet another SPAM learning script

Offline kevinb

  • *
  • 237
  • +0/-0
[ANNOUNNCE] Yet another SPAM learning script
« on: September 26, 2009, 05:54:01 PM »
This script is designed for our environment (SOGo web mail with Thunderbird email clients, neither Horde nor Outlook are used). It will teach spam as well as ham to Spamassassin.

Please feel free to add any comments!

Requirements:
  • Users are instructed that any email they wish to save must be moved to the "Saved" folder or any folder under the "Saved" folder in Thunderbird or SOGo.
  • In Thunderbird the default profile that is pushed out to the users has the Junk folder set to "junkmail" and email flagged as Junk are set to "read".
  • Users should use the "Junk"/"Not Junk" buttons in Thunderbird to flag spam. This way both Thunderbird and Spamassassin are learning.

Once a day the script will:
  • Search the "junkmail" folder for read emails less than one day old and feed them to "sa-learn as "spam".
  • Search all folders, including and under the "Saved" folder for read emails that are less than one day old and feed them to "sa-learn" as "ham".
  • Search the "Trash" fodler for any emails older than 30 days and delete them.
  • Email the log file to the admin.

In a shell:

Code: [Select]
nano -w learnspam.sh
Note that you must replace <your server hostname> with your server's hostname ("server" for "server.sme.org") in the following code.

Code: [Select]
#!/bin/bash
(
date
echo ''
echo ''
for userdir in $(ls -A1 /home/e-smith/files/users)
do
        echo $userdir
        echo "Find email in the junkmail folder that is less than 24 hours old and feed it to sa-learn"
        test -d /home/e-smith/files/users/$userdir/Maildir/.junkmail/cur && find /home/e-smith/files/users/$userdir/Maildir/.junkmail/cur -iname '*.<your server hostname>*' -type f -ctime 0 -exec sa-learn --spam --no-sync '{}' \; || echo "    No junkmail folder"

        echo "Find email in the Saved folder and sub-folders that is less than 24 hours old and feed it to sa-learn"
        test -d /home/e-smith/files/users/$userdir/Maildir/.Saved && find /home/e-smith/files/users/$userdir/Maildir -path '*/.Saved*cur*.<your server hostname>*' -type f -ctime 0 -exec sa-learn --ham --no-sync '{}' \; || echo "    No Saved mail folders"

        echo "Find email in the Trash folder that is more than 30 days old and delete it"
        test -d /home/e-smith/files/users/$userdir/Maildir/.Trash && find /home/e-smith/files/users/$userdir/Maildir/.Trash -type f -ctime +30 -exec rm -vf '{}' \; || echo "    No Trash folder"

echo
done

sa-learn --sync
date
) 1>/var/log/teach_spam.log 2>&1

sleep 10
mail -s "Learn SPAM" admin </var/log/teach_spam.log

Code: [Select]
chmod +x learnspam.sh
Configure the script to run once a day, preferably just after midnight. I have it setup to run as a pre-command to an AFFA backup.

Known issues and comments:

  • Emails caught as SPAM by Spamassassin and are moved to the "junkemail" folder will be taught as spam to Spamassassin. This is not an issue since Spamassassin will recognize these emails and not learn new tokens from them but it does take more resources. A work around is to leave the default setting in Thunderbird and have it drop spam into the "Junk" folder and then have the script only learn from the "Junk" folder and then move these files to the "junkmail" folder. The downside to this is that the users will have to look for false spam email in two folders.
  • Email that is not flagged as "read" is ignored. This can be changed by having the script look in the actual listed folder and not the "cur" sub-folder or have it look in both the "cur" and "new" sub-folders.
  • Emails that are over a day old in the "Inbox" before they are processed by the client or user may be ignored. I am not sure if the file system flags a file as "changed" after a move.

I hope some of you find this useful.

Kevin

Offline cactus

  • *
  • 4,880
  • +3/-0
    • http://www.snetram.nl
Re: [ANNOUNNCE] Yet another SPAM learning script
« Reply #1 on: September 27, 2009, 10:10:18 AM »
Configure the script to run once a day, preferably just after midnight. I have it setup to run as a pre-command to an AFFA backup.
Which might IMHO have the undesired drawback that if the script fails the pre-backup event fails and no backup is made. Why not ust configure a seperate cronjob for it?
Be careful whose advice you buy, but be patient with those who supply it. Advice is a form of nostalgia, dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than its worth ~ Baz Luhrmann - Everybody's Free (To Wear Sunscreen)

Offline kevinb

  • *
  • 237
  • +0/-0
Re: [ANNOUNNCE] Yet another SPAM learning script
« Reply #2 on: September 27, 2009, 03:34:50 PM »
Good point cactus,

I did not test this. I do not know what happens to the affa job if the precommand fails. Does it timeout and continue?

I run it before so the "find" command is looking back 24 hours from a consistent starting point and, if the move files from the Junk folder to the junkmail folder method is used you do not risk moving files during the backup.

A separate cron may be the most advisable method.

Offline Knuddi

  • *
  • 540
  • +0/-0
    • http://www.scanmailx.com
Re: [ANNOUNNCE] Yet another SPAM learning script
« Reply #3 on: September 29, 2009, 09:47:10 PM »
Did you consider to use SpamAssassin Coach for this purpose? As far as I can see then all which is needed is for the spamd process to accept connection from others than the localhost. This way your users can determine themselves at the speed they like.

http://sourceforge.net/projects/soc2006spamd/

I have tried myself, but maybe its time...

/Jesper

Offline kevinb

  • *
  • 237
  • +0/-0
Re: [ANNOUNNCE] Yet another SPAM learning script
« Reply #4 on: September 30, 2009, 01:21:19 AM »
I did not know that project existed ... thanks Jesper.

There is not much there for docs.

Does it only "coach" SA when the user decides it's spam or not? Or does it also learn from TB flags as spam?

BTW ... the above script can learn every spam email if you have TB use the default Junk folder, sa-learn every email in the Junk folder, then move the emails to junkmail. Every ham email can be taught if you use a diff command. But my thought was we'll get enough taught with the simpler method presented above.

Thanks for the feedback!

Offline Knuddi

  • *
  • 540
  • +0/-0
    • http://www.scanmailx.com
Re: [ANNOUNNCE] Yet another SPAM learning script
« Reply #5 on: September 30, 2009, 08:17:21 AM »
As far as I can read it does sa-learn via SA's socket interface based on user feedback. So if the user classifies an email as spam and presses that button it will learn that email as spam (and remove it). It doesn't run through mails in the Inbox or junkmail folder as I see it.

/Jesper