Summary

As soon as we moved our spam filtering from a milter-based setup to a MDA-based setup using maildrop/mailfilter calling spamc we started having odd timeout problems causing mail to be bounced back to the sender. Why? We have been using maildrop/mailfilter for years now without any problems, and the only change in our setup was to deactivate the milter in postfix on the MX servers, and add the spamc processing in the global mailfilter config for the users on the back-end storage hosts, so what is causing this problem? This document details the setup involved, and the tests devised to try and identify where this problem originates.

Description of problem

People were contacting us to tell us that they were receiving bounced messages when they tried to send to users. These bounced messages looked like this:

Command died with status 100: "/usr/local/bin/maildrop_wrapper". Command output: maildrop: Timeout quota exceeded.

The maildrop_wrapper is a very simple script which simply calls maildrop with the message information passed as arguments, and then depending on the result code, it will exit with a 100 or not. When we deliver to a user who is over quota maildrop returns a 77 (and in some cases a 75), this is considered a EX_TEMPFAIL by postfix and as a result it puts these messages in the deferred queue and retries over and over again for 4 days and then eventually returns it the original sender. However, because we do not want these messages queued up in our deferral queue, we want to return a 100 in these cases, resulting in Postfix seeing the delivery as a permanent failure so that the message will be bounced back to the original sender. If we don’t do this we fill our queues with people over quota, due the large number of people we are handling mail for. This wrapper looks like this:

codetitle. /usr/local/bin/maildrop_wrapper

#!/bin/sh

/usr/bin/maildrop $1 $2 $3 $4 $5
if [[ "$?" == "77" || "$?" == "75" ]]; then
 exit 100
fi

Typically when a user is over quota the sender of the email will receive the same bounced error message as above, except the word “Timeout” is not included. The word Timeout is put in there by maildrop. It comes from maildrop/globaltimer.C and is a result of maildrop’s maximum timer being reached (this is a compile-time value —enable-global-timeout with the default value set to 300 seconds):

codetitle. maildrop/globaltimer.C

static const char msg[]="maildrop: Timeout quota exceeded.\n";

void GlobalTimer::handler()
{
        ExitTrap::onexit();
        write(2, msg, sizeof(msg)-1);
        _exit(EX_TEMPFAIL);
}

So what is happening is maildrop tries to deliver the message to the user, this is done by sending the message through the global /etc/maildroprc which sends the message via spamc to determine if it is spam or not and then deciding what to do with it.

codetitle. /etc/maildroprc

exception {
  if ( $SIZE < 250000 )
  {
    `logger -t maildrop -pmail.info "Checking for spam: $FR $TO"`
    xfilter "/usr/bin/spamc -d spamd2-pn -t 100 -u $LOGNAME"
  }
  if (/^X-Spam-Flag: YES/)
  {
    exception {
      `logger -t maildrop -pmail.info "Found spam: $FR $TO ($SUB)"`
       to "$DEFAULT/.Spam/"
    }
  }
  else   
  {
    `logger -t maildrop -pmail.info "Delivering: $FR $TO $DEFAULT"`
  }
}

What seemed to be happening is that the message would sent to maildrop, it would try to scan it for spam, and somewhere in this maildrop delivery phase it would take longer than 300 seconds and thus trigger maildrop’s global timer and then bounce it back with a EXTEMP_FAIL, which would trigger the wrapper to turn it into a permanent error 100 and bounce it back to the user.

At first I was convinced the problem had to do with the bayes DB because it was using MyISAM tables and these are slow due to the table-level locking, and we are doing a lot of mail and tables were getting locked. So we switched to InnoDB and things seemed a little faster, but the problem was still happening. On further examination, I became convinced that the problem had to do with improper table locking during SA auto-expire periods and as a result dug deep into the SA SQL and submitted a bug to enhance the query so it can use indexes.

Even after all this I was getting reports from people who received bounced messages saying that the default maildrop timeout. We run spamc with -t 100 and expected that this meant that after 100 seconds if the message wasn’t returned from spamd, then we simply accepted the message without any spam scanning. However, it seemed like things were lasting far longer than 100 seconds (3x as long to hit the maildrop timeout) and so our theory was that -t wasn’t working properly.

Because of these incorrect bounces, this meant we were not delivering legitimate email, and so we turned off spamassassin and began digging deeper to try and determine what was causing this.

As a result, I’ve formulated and executed the following tests.

We need to run spamc in a number of failure situations to see if the -t (timeout) option is being honored, simulating the potential problem scenarios is the only way we can determine where the problem is.

test setup

We will use the unused spamd3 vserver (located on heron), we will stop puppet from running so that it wont revert any changes we make to the spam configs (this means shutting down the daemon, but also removing /etc/cron.hourly/puppet). We will test injecting messages from cormorant over the private network (-d spamd3-pn). Because these messages are coming from our internal network we have to turn off the rule short-circuiting that by passing pretty much everything coming from our own machines (comment out the loadplugin Mail::SpamAssassin::Plugin::Shortcircuit line from /etc/spamassassin/v320.pre and restart spamd).

Because there will be two separate systems involved, and some of the tests need to be run quickly after after a command is typed in the other window, we have to quickly switch to the other window and run the second command. This introduces a little bit of delay overhead that can be eliminated if we use cssh (clusterssh) to connect to both machines and prepare each statement on each machine and then hit ‘enter’ in cssh to execute both commands at the same time.

A typical test would look like this:

time /usr/bin/spamc -d spamd3-pn -t 10 -u micah < /usr/share/doc/spamassassin/examples/sample-spam.txt

spam timeout tests

spamd isn’t running

Result: Regardless if -t 10, or -t 2 is passed to spamc, the message is returned unprocessed in 0m3.034s time.

spamd is being shutdown

Result: This test could only be run with cssh because spamassassin stops so fast I can’t issue the spamc test in the other window.

Results: the same as ‘spamd isn’t running’

spamd is being started

Results:

If the message is sent over spamc before the spamd has enough time to begin listening, the results are the same as ‘spamd isn’t running’ and returns in 3 seconds.

If the daemon managed to get the socket up and listening, but hasn’t finished preparing the rules, then the spamc will return in the amount of time specified by -t unless the startup finishes and the processing can happen before -t timeout is reached. For example, if I use -t 20 and I start up spamd and send the spamc request immediately after it has had a half-second to setup a socket, the message will be returned in 12 seconds scanned for spam. If I use -t 10 in the same scenario, the timeout will be reached before spamassassin is able to startup, so the message is returned in 10 seconds without being scanned.

spamd can’t spawn more processes (overloaded)

For this test we set spamd flags —max-children=1 (default 5) and —max-conn-per-child=1 (default is 200) then we send multiple spamc requests at the same time.

Results:

Both spamc requests were sent at the same time, one returned before the other in 1 second, the second returned a second later. I setup 12 individual cormorant windows, each over cssh and then sent the same spamc request with -t 15, the following were the time results:

real    0m1.267s
real    0m2.567s
real    0m3.986s
real    0m5.178s
real    0m6.461s
real    0m7.914s
real    0m9.090s
real    0m10.361s
real    0m11.738s
real    0m13.377s
real    0m15.033s -- returned un-scanned
real    0m15.026s -- returned un-scanned

spamd logs look like this (these logs are for the two spamc tests, not the 12 one):

Sep 28 09:25:48 spamd3 spamd[9643]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 33558 
Sep 28 09:25:48 spamd3 spamd[9643]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 33558 
Sep 28 09:25:48 spamd3 spamd[9643]: spamd: processing message <GTUBE1.1010101@example.net> for micah:65534 
Sep 28 09:25:49 spamd3 spamd[9643]: spamd: identified spam (1003.6/8.0) for micah:65534 in 1.1 seconds, 799 bytes. 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: child states: B 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: server reached --max-children setting, consider raising it 
Sep 28 09:25:49 spamd3 spamd[9641]: spamd: handled cleanup of child pid 9643 due to SIGCHLD 
Sep 28 09:25:49 spamd3 spamd[9641]: spamd: server successfully spawned child process, pid 9652 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: child states: B 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: server reached --max-children setting, consider raising it 
Sep 28 09:25:49 spamd3 spamd[9652]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 33559 
Sep 28 09:25:49 spamd3 spamd[9652]: spamd: processing message <GTUBE1.1010101@example.net> for micah:65534 
Sep 28 09:25:50 spamd3 spamd[9652]: spamd: identified spam (1003.6/8.0) for micah:65534 in 1.1 seconds, 799 bytes. 
Sep 28 09:25:50 spamd3 spamd[9641]: prefork: child states: B 
Sep 28 09:25:50 spamd3 spamd[9641]: prefork: server reached --max-children setting, consider raising it 
Sep 28 09:25:50 spamd3 spamd[9641]: spamd: handled cleanup of child pid 9652 due to SIGCHLD 
Sep 28 09:25:50 spamd3 spamd[9641]: spamd: server successfully spawned child process, pid 9654 
Sep 28 09:25:50 spamd3 spamd[9641]: prefork: child states: I 

According to the spamd man page, the max-children works like this:

Incoming connections can still occur if all of the children are busy, however those connections will be queued waiting for a free child . . . Please note that there is a OS specific maximum of connections that can be queued (Try “perl -MSocket -e’print SOMAXCONN’” to find this maximum).

This number is 128, but it can be tweaked by echoing a new value to /proc/sys/net/core/somaxconn — since its likely this will be hit, we need to test the scenario when this is maxed out, so we will set this to 10 and retry the tests:

echo "5" > /proc/sys/net/core/somaxconn

I then issued 15 simultaneous connections with the -t value set to 15. Each individual connection took one second longer than the previous, as before, and those connections that took over the -t value were returned unscanned, as before. What puzzles me however is the fact that all of the connections acted just as before, when the SOMAXCONN was set to 128. Since each connection is coming in at exactly the same time I would expect that the first one would get accepted by spamd, 5 connections
would be queued up, and then the 6+ connections would not be queued up and something would happen, but it doesn’t, its the same as before… odd.

And then I found this bug report which indicates that Socket::SOMAXCONN() returns 128 no matter what the value is set in proc! But reading this a little closer makes me realize that if I set the kernel value lower, perl will still think it is 128, but it will act according to what the kernel variable is set to.

So, maybe the understanding of somaxconn isn’t right… reading more about it…it controls the number of connections that can be in the state where the socket is open on the remote end but you haven’t accept()ed the connection yet. Once you accept, a connection doesn’t count anymore.

Since spamassassin accepts connections then it will be hard to reach this unless we somehow generate a serious flood of requests, which is what happens with our mail system, but we need to write some kind of program that can do this.

So I wrote this:

#!/bin/sh

for number in `seq 1 3000`
do
        ( time /usr/bin/spamc -d spamd3-pn -t 15 -u micah < /usr/share/doc/spamassassin/examples/sample-spam.txt ;) >> /tmp/blah 2>&1 &
done

which spawns 1000 spamc’s to scan against the spamd which is only set to have 1 child-max and one max-conn-per child. I run this and then the results show:

% grep 'X-Spam-Flag: YES' /tmp/blah |wc
     58     116     986
% grep Message-ID /tmp/blah |wc
   3000    6000  123000
% grep real /tmp/blah |wc
   3000    6000   44993

This shows that only 58 of them were returned with scanned results, but all 3000 returned a message.

database tests

To do these tests we need to setup a DB in a non-production environment so we can tweak these settings without impacting the current running database server. To better mimic what we are using in the production environment, the DB server should be on its own server, the spamd on its own server and the spamc coming from a third. To do this, I setup a vserver called “spamd_test” on loon (root@loon# vserver spamd_test build -m debootstrap --context 40 --hostname spamd_test --interface eth2:204.13.164.88 --interface eth3:10.0.2.1 -- -d etch), I then installed the same version of mysql-server that our DB server has (5.0.45-1~bpo.1 from backports.org, as of this writing this is the most current stable release of mysql). I then bring over the my.cnf from our DB environment, and the bayes.sql and nest.sql databases to load them in to this DB. It seems like the bayes.sql dump created MyISAM tables, instead of what we have on the primary DB (innodb), so I converted them by doing “ALTER TABLE ENGINE=InnoDB;” which takes a while.

Once these are loaded, I had to mess with the mysql user table to allow the user to connect from the new server to the DB, and I also changed sql.cf and local.cf mysql settings to use the new server.

We have two databases that are being queried, one is the per-user spam settings, and one is the bayes database. We can reliably assume that the DB is being contacted to query the per-user settings on every spamc invocation, but how can we reliably test against the bayes db? There are two bayes scenarios, auto-whitelisting (AWL) will classify a message as spam/ham depending on particular criteria, and it does this by inserting tokens into the bayes_token table when it decides to do this…but I am not sure if we can force that in our tests. The second bayes scenario is checking the bayes viability of a message, this is a read-only test, and I think it will be run whenever the bayes DB is available so we can rely on this one happening every time.

maximum number of connections to DB reached

In mysql there is a variable that sets a maximum number of connections from a single IP, after that is reached strange things happen. What if we artificially set this and then try to exceed it?

Setup: change /etc/mysql/my.cnf to have

max_connections = 5
and restart mysql, flood spamd with thousands of requests in the same way that is done in the test “spamd cant spawn new processes”.

First test is with the same max-conn and max-children settings from the “spamd cant spawn new processes” test.

Results:

root@cormorant:~# !1724
grep Message-ID /tmp/blah |wc
   3000    6000  123000
root@cormorant:~# !1718
grep 'X-Spam-Flag: YES' /tmp/blah |wc
      8      16     136

Not many are able to be returned by spamassassin as scanned, but at least all messages are returned. Longest time to return was 1 minute and ten seconds (even though the spamc timeout was set to 15 seconds).

Strangely, I still see in the logs on the spamd that results are being returned, but our logfile isn’t increasing, it seems like a lot of these requests were queued up on the spamd side, and they get processed, but the socket has been closed by spamc already, so although it continues to process on the spamd side, it can’t return the results, in fact on the spamd side we see these logs:

Oct  1 10:08:13 spamd3 spamd[27386]: spamd: timeout: (30 second socket timeout reading input from client) at /usr/sbin/spamd line 2000. 
Oct  1 10:08:13 spamd3 spamd[27322]: prefork: child states: III 
Oct  1 10:08:13 spamd3 spamd[27322]: prefork: child states: IIK 
Oct  1 10:08:13 spamd3 spamd[27322]: spamd: handled cleanup of child pid 27386 due to SIGCHLD 

The second test I performed was to increase the maximum children to 50 and then re-run the flood:

root@cormorant:~# grep 'X-Spam-Flag: YES' /tmp/blah |wc
     32      64     544
root@cormorant:~# grep Message-ID /tmp/blah |wc
   3000    6000  123000

The longest time to return was 59 seconds.

The spamd logs have this:

Oct  1 10:04:50 spamd3 spamd[27344]: config: failed to load user (micah) scores from SQL database: config: SQL error: Too many connections 
Oct  1 10:04:50 spamd3 spamd[27344]: spamd: service unavailable: Error fetching user preferences via SQL at /usr/sbin/spamd line 1993, <GEN51> line 2. 

what if the DB is available but really slow (high load, tables locked)

I’m not sure how to setup a test where the load is really high on the DB server.

To artificially lock the tables we can do a:

FLUSH TABLES WITH READ LOCK

This will lock the table(s?) and keep writes from happening how do we keep reads from happening?

Results: 15 seconds elapse and then the message is returned without being scanned, database locks the tables because its trying to update the bayes_token information:

mysql> show processlist\g
+------+---------+-----------------+-------+---------+------+--------
| Id   | User    | Host            | db    | Command | Time | State                           | Info                                                                                                 |
+------+---------+-----------------+-------+---------+------+--------
| 1604 | root    | localhost       | nest  | Query   |    0 | NULL                            | show processlist                                                                                     | 
| 1606 | spamass | 10.0.1.40:57210 | bayes | Query   |    8 | Waiting for release of readlock | UPDATE bayes_token SET atime = '1058995800' WHERE id = '4' AND token IN ('e??<?','???#N','G9??M','?
+------+---------+-----------------+-------+---------+------+--------

Just for kicks, I left the read lock on this test database and looked some time later to see if the process was still hanging around, turned out it was:

| 1606 | spamass | 10.0.1.40:57210 | bayes | Query   | 23852 | Waiting for release of readlock | UPDATE bayes_token SET atime = '1058995800' WHERE id = '4' AND token IN ('e??<?','???#N','G9??M','?

Thats 23852 seconds… it waited longer than that because I let it sit for longer… eventually it timed out, but that seems like an awfully long time, especially since the spamc process that instigated it died off ages ago. It might be good to figure out what mysql variable sets the maximum time to wait for a read lock, and ratchet that down a little.

per-user prefs/bayes DB is unreachable (rejected/drop packets)

Results:

DB connection goes away in the middle of doing things

Results:

DNS resolving is broken

We set this one up by changing the resolver on the spamd server to point to somewhere that doesn’t respond:

# cat /etc/resolv.conf
nameserver 127.0.0.2
# host google.com
;; connection timed out; no servers could be reached
# /etc/init.d/spamassassin restart

Results: the spamd process fails to contact the database because it can’t look up the database host name we have configured, the message is returned unscanned immediately before any timeout can be reached.

The logs show this:

Sep 28 11:41:19 spamd3 spamd[3326]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 47354 
Sep 28 11:41:19 spamd3 spamd[3326]: config: failed to load user (micah) scores from SQL database: config: SQL error: Unknown MySQL server host 'dbw-pn' (2) 
Sep 28 11:41:19 spamd3 spamd[3326]: spamd: service unavailable: Error fetching user preferences via SQL at /usr/sbin/spamd line 1993, <GEN14> line 2. 

So lets change the DB host to use the IP, and re-do the test, so we change /etc/spamassassin/sql.cf to change user_scores_dsn and bayes_sql_dsn in /etc/spamassassin/local.cf to use the IP, then we restart spamassassin.

Results: Message is returned scanned, logs show a failure to connect to razor:

Sep 28 11:48:04 spamd3 spamd[4012]: razor2: razor2 check failed: Invalid argument razor2: razor2 had unknown error during get_server_info at /usr/share/perl5/Mail/SpamAssassin/Plugin/Razor2.pm line 188. at /usr/share/perl5/Mail/SpamAssassin/Plugin/Razor2.pm line 326. 

What if we set the resolver to have a secondary nameserver that works, so it looks like:

nameserver 127.0.0.2
nameserver 4.2.2.2

we restart spamassassin to pick these up (it seems to cache the namserver information, so it needs to be restarted).

Results: message is returned, scanned properly for spam but it takes 1 second longer as it has to wait for the first resolver to timeout, then it tries the second

high-fructose lasers are cutting through the blast doors

Results: ascii art dolphins save the day!!!

timeout flags

Spamd has a couple timeout flags that might be useful for our situation:

       
       --timeout-tcp=number
           This option specifies the number of seconds to wait for headers
           from a client (spamc) before closing the connection.  The minimum
           value is 1, the default value is 30, and a value of 0 will disable
           socket timeouts completely.

       --timeout-child=number
           This option specifies the number of seconds to wait for a spamd
           child to to process or check a message.  The minimum value is 1,
           the default value is 300, and a value of 0 will disable child time?
           outs completely.

Spamc has the following timeout option:

       -t timeout, --timeout=timeout
           Set the timeout for spamc-to-spamd communications (default: 600, 0
           disables).  If spamd takes longer than this many seconds to reply
           to a message, spamc will abort the connection and treat this as a
           failure to connect; in other words the message will be returned
           unprocessed.