Spam timeout tests

What is causing the timeouts? What are some tests we can run to figure this out?

Summary

As soon as we moved our spam filtering from a milter-based setup to a MDA-based setup using maildrop/mailfilter calling spamc we started having odd timeout problems causing mail to be bounced back to the sender. Why? We have been using maildrop/mailfilter for years now without any problems, and the only change in our setup was to deactivate the milter in postfix on the MX servers, and add the spamc processing in the global mailfilter config for the users on the back-end storage hosts, so what is causing this problem? This document details the setup involved, and the tests devised to try and identify where this problem originates.

Description of problem

People were contacting us to tell us that they were receiving bounced messages when they tried to send to users. These bounced messages looked like this:

Command died with status 100: "/usr/local/bin/maildrop_wrapper". Command output: maildrop: Timeout quota exceeded.

The maildrop_wrapper is a very simple script which simply calls maildrop with the message information passed as arguments, and then depending on the result code, it will exit with a 100 or not. When we deliver to a user who is over quota maildrop returns a 77 (and in some cases a 75), this is considered a EX_TEMPFAIL by postfix and as a result it puts these messages in the deferred queue and retries over and over again for 4 days and then eventually returns it the original sender. However, because we do not want these messages queued up in our deferral queue, we want to return a 100 in these cases, resulting in Postfix seeing the delivery as a permanent failure so that the message will be bounced back to the original sender. If we don’t do this we fill our queues with people over quota, due the large number of people we are handling mail for. This wrapper looks like this:

codetitle. /usr/local/bin/maildrop_wrapper

#!/bin/sh

/usr/bin/maildrop $1 $2 $3 $4 $5
if [[ "$?" == "77" || "$?" == "75" ]]; then
 exit 100
fi

Typically when a user is over quota the sender of the email will receive the same bounced error message as above, except the word “Timeout” is not included. The word Timeout is put in there by maildrop. It comes from maildrop/globaltimer.C and is a result of maildrop’s maximum timer being reached (this is a compile-time value —enable-global-timeout with the default value set to 300 seconds):

codetitle. maildrop/globaltimer.C

static const char msg[]="maildrop: Timeout quota exceeded.\n";

void GlobalTimer::handler()
{
        ExitTrap::onexit();
        write(2, msg, sizeof(msg)-1);
        _exit(EX_TEMPFAIL);
}

So what is happening is maildrop tries to deliver the message to the user, this is done by sending the message through the global /etc/maildroprc which sends the message via spamc to determine if it is spam or not and then deciding what to do with it.

codetitle. /etc/maildroprc

exception {
  if ( $SIZE < 250000 )
  {
    `logger -t maildrop -pmail.info "Checking for spam: $FR $TO"`
    xfilter "/usr/bin/spamc -d spamd2-pn -t 100 -u $LOGNAME"
  }
  if (/^X-Spam-Flag: YES/)
  {
    exception {
      `logger -t maildrop -pmail.info "Found spam: $FR $TO ($SUB)"`
       to "$DEFAULT/.Spam/"
    }
  }
  else   
  {
    `logger -t maildrop -pmail.info "Delivering: $FR $TO $DEFAULT"`
  }
}

What seemed to be happening is that the message would sent to maildrop, it would try to scan it for spam, and somewhere in this maildrop delivery phase it would take longer than 300 seconds and thus trigger maildrop’s global timer and then bounce it back with a EXTEMP_FAIL, which would trigger the wrapper to turn it into a permanent error 100 and bounce it back to the user.

At first I was convinced the problem had to do with the bayes DB because it was using MyISAM tables and these are slow due to the table-level locking, and we are doing a lot of mail and tables were getting locked. So we switched to InnoDB and things seemed a little faster, but the problem was still happening. On further examination, I became convinced that the problem had to do with improper table locking during SA auto-expire periods and as a result dug deep into the SA SQL and submitted a bug to enhance the query so it can use indexes.

Even after all this I was getting reports from people who received bounced messages saying that the default maildrop timeout. We run spamc with -t 100 and expected that this meant that after 100 seconds if the message wasn’t returned from spamd, then we simply accepted the message without any spam scanning. However, it seemed like things were lasting far longer than 100 seconds (3x as long to hit the maildrop timeout) and so our theory was that -t wasn’t working properly.

Because of these incorrect bounces, this meant we were not delivering legitimate email, and so we turned off spamassassin and began digging deeper to try and determine what was causing this.

As a result, I’ve formulated and executed the following tests.

We need to run spamc in a number of failure situations to see if the -t (timeout) option is being honored, simulating the potential problem scenarios is the only way we can determine where the problem is.

test setup

We will use the unused spamd3 vserver (located on heron), we will stop puppet from running so that it wont revert any changes we make to the spam configs (this means shutting down the daemon, but also removing /etc/cron.hourly/puppet). We will test injecting messages from cormorant over the private network (-d spamd3-pn). Because these messages are coming from our internal network we have to turn off the rule short-circuiting that by passing pretty much everything coming from our own machines (comment out the loadplugin Mail::SpamAssassin::Plugin::Shortcircuit line from /etc/spamassassin/v320.pre and restart spamd).

Because there will be two separate systems involved, and some of the tests need to be run quickly after after a command is typed in the other window, we have to quickly switch to the other window and run the second command. This introduces a little bit of delay overhead that can be eliminated if we use cssh (clusterssh) to connect to both machines and prepare each statement on each machine and then hit ‘enter’ in cssh to execute both commands at the same time.

A typical test would look like this:

time /usr/bin/spamc -d spamd3-pn -t 10 -u micah < /usr/share/doc/spamassassin/examples/sample-spam.txt

spam timeout tests

spamd isn’t running

Result: Regardless if -t 10, or -t 2 is passed to spamc, the message is returned unprocessed in 0m3.034s time.

spamd is being shutdown

Result: This test could only be run with cssh because spamassassin stops so fast I can’t issue the spamc test in the other window.

Results: the same as ‘spamd isn’t running’

spamd is being started

Results:

If the message is sent over spamc before the spamd has enough time to begin listening, the results are the same as ‘spamd isn’t running’ and returns in 3 seconds.

If the daemon managed to get the socket up and listening, but hasn’t finished preparing the rules, then the spamc will return in the amount of time specified by -t unless the startup finishes and the processing can happen before -t timeout is reached. For example, if I use -t 20 and I start up spamd and send the spamc request immediately after it has had a half-second to setup a socket, the message will be returned in 12 seconds scanned for spam. If I use -t 10 in the same scenario, the timeout will be reached before spamassassin is able to startup, so the message is returned in 10 seconds without being scanned.

spamd can’t spawn more processes (overloaded)

For this test we set spamd flags —max-children=1 (default 5) and —max-conn-per-child=1 (default is 200) then we send multiple spamc requests at the same time.

Results:

Both spamc requests were sent at the same time, one returned before the other in 1 second, the second returned a second later. I setup 12 individual cormorant windows, each over cssh and then sent the same spamc request with -t 15, the following were the time results:

real    0m1.267s
real    0m2.567s
real    0m3.986s
real    0m5.178s
real    0m6.461s
real    0m7.914s
real    0m9.090s
real    0m10.361s
real    0m11.738s
real    0m13.377s
real    0m15.033s -- returned un-scanned
real    0m15.026s -- returned un-scanned

spamd logs look like this (these logs are for the two spamc tests, not the 12 one):

Sep 28 09:25:48 spamd3 spamd[9643]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 33558 
Sep 28 09:25:48 spamd3 spamd[9643]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 33558 
Sep 28 09:25:48 spamd3 spamd[9643]: spamd: processing message <GTUBE1.1010101@example.net> for micah:65534 
Sep 28 09:25:49 spamd3 spamd[9643]: spamd: identified spam (1003.6/8.0) for micah:65534 in 1.1 seconds, 799 bytes. 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: child states: B 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: server reached --max-children setting, consider raising it 
Sep 28 09:25:49 spamd3 spamd[9641]: spamd: handled cleanup of child pid 9643 due to SIGCHLD 
Sep 28 09:25:49 spamd3 spamd[9641]: spamd: server successfully spawned child process, pid 9652 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: child states: B 
Sep 28 09:25:49 spamd3 spamd[9641]: prefork: server reached --max-children setting, consider raising it 
Sep 28 09:25:49 spamd3 spamd[9652]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 33559 
Sep 28 09:25:49 spamd3 spamd[9652]: spamd: processing message <GTUBE1.1010101@example.net> for micah:65534 
Sep 28 09:25:50 spamd3 spamd[9652]: spamd: identified spam (1003.6/8.0) for micah:65534 in 1.1 seconds, 799 bytes. 
Sep 28 09:25:50 spamd3 spamd[9641]: prefork: child states: B 
Sep 28 09:25:50 spamd3 spamd[9641]: prefork: server reached --max-children setting, consider raising it 
Sep 28 09:25:50 spamd3 spamd[9641]: spamd: handled cleanup of child pid 9652 due to SIGCHLD 
Sep 28 09:25:50 spamd3 spamd[9641]: spamd: server successfully spawned child process, pid 9654 
Sep 28 09:25:50 spamd3 spamd[9641]: prefork: child states: I 

According to the spamd man page, the max-children works like this:

Incoming connections can still occur if all of the children are busy, however those connections will be queued waiting for a free child . . . Please note that there is a OS specific maximum of connections that can be queued (Try “perl -MSocket -e’print SOMAXCONN’” to find this maximum).

This number is 128, but it can be tweaked by echoing a new value to /proc/sys/net/core/somaxconn — since its likely this will be hit, we need to test the scenario when this is maxed out, so we will set this to 10 and retry the tests:

echo "5" > /proc/sys/net/core/somaxconn

I then issued 15 simultaneous connections with the -t value set to 15. Each individual connection took one second longer than the previous, as before, and those connections that took over the -t value were returned unscanned, as before. What puzzles me however is the fact that all of the connections acted just as before, when the SOMAXCONN was set to 128. Since each connection is coming in at exactly the same time I would expect that the first one would get accepted by spamd, 5 connections
would be queued up, and then the 6+ connections would not be queued up and something would happen, but it doesn’t, its the same as before… odd.

And then I found this bug report which indicates that Socket::SOMAXCONN() returns 128 no matter what the value is set in proc! But reading this a little closer makes me realize that if I set the kernel value lower, perl will still think it is 128, but it will act according to what the kernel variable is set to.

So, maybe the understanding of somaxconn isn’t right… reading more about it…it controls the number of connections that can be in the state where the socket is open on the remote end but you haven’t accept()ed the connection yet. Once you accept, a connection doesn’t count anymore.

Since spamassassin accepts connections then it will be hard to reach this unless we somehow generate a serious flood of requests, which is what happens with our mail system, but we need to write some kind of program that can do this.

So I wrote this:

#!/bin/sh

for number in `seq 1 3000`
do
        ( time /usr/bin/spamc -d spamd3-pn -t 15 -u micah < /usr/share/doc/spamassassin/examples/sample-spam.txt ;) >> /tmp/blah 2>&1 &
done

which spawns 1000 spamc’s to scan against the spamd which is only set to have 1 child-max and one max-conn-per child. I run this and then the results show:

% grep 'X-Spam-Flag: YES' /tmp/blah |wc
     58     116     986
% grep Message-ID /tmp/blah |wc
   3000    6000  123000
% grep real /tmp/blah |wc
   3000    6000   44993

This shows that only 58 of them were returned with scanned results, but all 3000 returned a message.

database tests

To do these tests we need to setup a DB in a non-production environment so we can tweak these settings without impacting the current running database server. To better mimic what we are using in the production environment, the DB server should be on its own server, the spamd on its own server and the spamc coming from a third. To do this, I setup a vserver called “spamd_test” on loon (root@loon# vserver spamd_test build -m debootstrap --context 40 --hostname spamd_test --interface eth2:204.13.164.88 --interface eth3:10.0.2.1 -- -d etch), I then installed the same version of mysql-server that our DB server has (5.0.45-1~bpo.1 from backports.org, as of this writing this is the most current stable release of mysql). I then bring over the my.cnf from our DB environment, and the bayes.sql and nest.sql databases to load them in to this DB. It seems like the bayes.sql dump created MyISAM tables, instead of what we have on the primary DB (innodb), so I converted them by doing “ALTER TABLE ENGINE=InnoDB;” which takes a while.

Once these are loaded, I had to mess with the mysql user table to allow the user to connect from the new server to the DB, and I also changed sql.cf and local.cf mysql settings to use the new server.

We have two databases that are being queried, one is the per-user spam settings, and one is the bayes database. We can reliably assume that the DB is being contacted to query the per-user settings on every spamc invocation, but how can we reliably test against the bayes db? There are two bayes scenarios, auto-whitelisting (AWL) will classify a message as spam/ham depending on particular criteria, and it does this by inserting tokens into the bayes_token table when it decides to do this…but I am not sure if we can force that in our tests. The second bayes scenario is checking the bayes viability of a message, this is a read-only test, and I think it will be run whenever the bayes DB is available so we can rely on this one happening every time.

maximum number of connections to DB reached

In mysql there is a variable that sets a maximum number of connections from a single IP, after that is reached strange things happen. What if we artificially set this and then try to exceed it?

Setup: change /etc/mysql/my.cnf to have

max_connections = 5
and restart mysql, flood spamd with thousands of requests in the same way that is done in the test “spamd cant spawn new processes”.

First test is with the same max-conn and max-children settings from the “spamd cant spawn new processes” test.

Results:

root@cormorant:~# !1724
grep Message-ID /tmp/blah |wc
   3000    6000  123000
root@cormorant:~# !1718
grep 'X-Spam-Flag: YES' /tmp/blah |wc
      8      16     136

Not many are able to be returned by spamassassin as scanned, but at least all messages are returned. Longest time to return was 1 minute and ten seconds (even though the spamc timeout was set to 15 seconds).

Strangely, I still see in the logs on the spamd that results are being returned, but our logfile isn’t increasing, it seems like a lot of these requests were queued up on the spamd side, and they get processed, but the socket has been closed by spamc already, so although it continues to process on the spamd side, it can’t return the results, in fact on the spamd side we see these logs:

Oct  1 10:08:13 spamd3 spamd[27386]: spamd: timeout: (30 second socket timeout reading input from client) at /usr/sbin/spamd line 2000. 
Oct  1 10:08:13 spamd3 spamd[27322]: prefork: child states: III 
Oct  1 10:08:13 spamd3 spamd[27322]: prefork: child states: IIK 
Oct  1 10:08:13 spamd3 spamd[27322]: spamd: handled cleanup of child pid 27386 due to SIGCHLD 

The second test I performed was to increase the maximum children to 50 and then re-run the flood:

root@cormorant:~# grep 'X-Spam-Flag: YES' /tmp/blah |wc
     32      64     544
root@cormorant:~# grep Message-ID /tmp/blah |wc
   3000    6000  123000

The longest time to return was 59 seconds.

The spamd logs have this:

Oct  1 10:04:50 spamd3 spamd[27344]: config: failed to load user (micah) scores from SQL database: config: SQL error: Too many connections 
Oct  1 10:04:50 spamd3 spamd[27344]: spamd: service unavailable: Error fetching user preferences via SQL at /usr/sbin/spamd line 1993, <GEN51> line 2. 

what if the DB is available but really slow (high load, tables locked)

I’m not sure how to setup a test where the load is really high on the DB server.

To artificially lock the tables we can do a:

FLUSH TABLES WITH READ LOCK

This will lock the table(s?) and keep writes from happening how do we keep reads from happening?

Results: 15 seconds elapse and then the message is returned without being scanned, database locks the tables because its trying to update the bayes_token information:

mysql> show processlist\g
+------+---------+-----------------+-------+---------+------+--------
| Id   | User    | Host            | db    | Command | Time | State                           | Info                                                                                                 |
+------+---------+-----------------+-------+---------+------+--------
| 1604 | root    | localhost       | nest  | Query   |    0 | NULL                            | show processlist                                                                                     | 
| 1606 | spamass | 10.0.1.40:57210 | bayes | Query   |    8 | Waiting for release of readlock | UPDATE bayes_token SET atime = '1058995800' WHERE id = '4' AND token IN ('e??<?','???#N','G9??M','?
+------+---------+-----------------+-------+---------+------+--------

Just for kicks, I left the read lock on this test database and looked some time later to see if the process was still hanging around, turned out it was:

| 1606 | spamass | 10.0.1.40:57210 | bayes | Query   | 23852 | Waiting for release of readlock | UPDATE bayes_token SET atime = '1058995800' WHERE id = '4' AND token IN ('e??<?','???#N','G9??M','?

Thats 23852 seconds… it waited longer than that because I let it sit for longer… eventually it timed out, but that seems like an awfully long time, especially since the spamc process that instigated it died off ages ago. It might be good to figure out what mysql variable sets the maximum time to wait for a read lock, and ratchet that down a little.

per-user prefs/bayes DB is unreachable (rejected/drop packets)

Results:

DB connection goes away in the middle of doing things

Results:

DNS resolving is broken

We set this one up by changing the resolver on the spamd server to point to somewhere that doesn’t respond:

# cat /etc/resolv.conf
nameserver 127.0.0.2
# host google.com
;; connection timed out; no servers could be reached
# /etc/init.d/spamassassin restart

Results: the spamd process fails to contact the database because it can’t look up the database host name we have configured, the message is returned unscanned immediately before any timeout can be reached.

The logs show this:

Sep 28 11:41:19 spamd3 spamd[3326]: spamd: connection from 10.0.1.15 [10.0.1.15] at port 47354 
Sep 28 11:41:19 spamd3 spamd[3326]: config: failed to load user (micah) scores from SQL database: config: SQL error: Unknown MySQL server host 'dbw-pn' (2) 
Sep 28 11:41:19 spamd3 spamd[3326]: spamd: service unavailable: Error fetching user preferences via SQL at /usr/sbin/spamd line 1993, <GEN14> line 2. 

So lets change the DB host to use the IP, and re-do the test, so we change /etc/spamassassin/sql.cf to change user_scores_dsn and bayes_sql_dsn in /etc/spamassassin/local.cf to use the IP, then we restart spamassassin.

Results: Message is returned scanned, logs show a failure to connect to razor:

Sep 28 11:48:04 spamd3 spamd[4012]: razor2: razor2 check failed: Invalid argument razor2: razor2 had unknown error during get_server_info at /usr/share/perl5/Mail/SpamAssassin/Plugin/Razor2.pm line 188. at /usr/share/perl5/Mail/SpamAssassin/Plugin/Razor2.pm line 326. 

What if we set the resolver to have a secondary nameserver that works, so it looks like:

nameserver 127.0.0.2
nameserver 4.2.2.2

we restart spamassassin to pick these up (it seems to cache the namserver information, so it needs to be restarted).

Results: message is returned, scanned properly for spam but it takes 1 second longer as it has to wait for the first resolver to timeout, then it tries the second

high-fructose lasers are cutting through the blast doors

Results: ascii art dolphins save the day!!!

timeout flags

Spamd has a couple timeout flags that might be useful for our situation:

       
       --timeout-tcp=number
           This option specifies the number of seconds to wait for headers
           from a client (spamc) before closing the connection.  The minimum
           value is 1, the default value is 30, and a value of 0 will disable
           socket timeouts completely.

       --timeout-child=number
           This option specifies the number of seconds to wait for a spamd
           child to to process or check a message.  The minimum value is 1,
           the default value is 300, and a value of 0 will disable child time?
           outs completely.

Spamc has the following timeout option:

       -t timeout, --timeout=timeout
           Set the timeout for spamc-to-spamd communications (default: 600, 0
           disables).  If spamd takes longer than this many seconds to reply
           to a message, spamc will abort the connection and treat this as a
           failure to connect; in other words the message will be returned
           unprocessed.
 

man spamc has this:

--connect-retries=retries
           Retry connecting to spamd retries times.  The default is 3 times.

Perhaps spamc -t 100 results in 300 seconds timeout because it is trying three times.

 
 

At the moment we are running with a maildrop-mysql built with:

--enable-global-timeout=1200

and pietro suggested the following modification to the maildrop_wrapper which will cause any message that gets the Timeout notification to receive the right failure code:

#!/usr/bin/perl

exit if (`/usr/bin/maildrop @ARGV` =~ /timeout/is);
exit(100) if ($? == 75 || $? == 77);

If the problem is the user is over-quota, we want to return 100 to postfix, if it is a timeout we want to return 75.

This doesn’t solve the reason why we are getting a timeout, but it at least should keep us from returning these incorrect timeout messages to people.

Someone on the SA list suggested that this could be caused by a particular message (with a strange encoding likely) and if we could capture a message causing this and look at it, that would be useful. The problem is that when spamd gets stuck then we get timeouts for all kinds of residual emails and figuring out which one originally caused it is the important problem.

 
 

shouldn’t that be?:

exit(75) if (`/usr/bin/maildrop @ARGV` =~ /timeout/is);

 
 

yeah, you are right… put that in place yesterday thanks

 
 

I think we should change this to be more careful with executing shell scripts on our mailserver that are handling potentially hostile external input. if it weren’t for address verification in postfix the wrapper could potentially do something bad if someone managed to email `rm -rf /var/spool/mail`@yourdomain.com for example.

I dont know much about perl, but I think this would be a way to modify the one we have now to sanitize @ARGV before we use it:

#!/usr/bin/perl

use String::ShellQuote;

my $safe_args = shell_quote(@ARGV);
exit(75) if (`/usr/bin/maildrop $safe_args` =~ /timeout/is);
exit(100) if ($? == 75 || $? == 77);

 
 

I’ve put this new version of the wrapper into place

 
 

I have a theory about our wrapper, and I’m not sure about it yet…. but our wrapper basically does this:

1. sanitize the arguments
2. send the message to maildrop and look for the response “timeout is” and if it finds that, exit with a 75 (to requeue the message)
3. if the result of the maildrop operation is an exit code 75 or 77, then we exit with an exit code 100 to bounce the message and not fill our deferred queue

What happens to the message if the result code of the maildrop operation is any other exit code than 0, 75, or 77? Say the exit code of trying to run that maildrop command is 255, or whatever?

If the result is 255, it wont exit with a 75, it wont exit with a 100. I did some tests and I found that the exit code of any perl script is whatever you pass to exit(), or by default 0!

This means that any errors, other than 75 or 77, when sending messages to maildrop will end up returning a 0 exit code, which postfix will consider as a successful delivery! I think this is where our problem is.

I think we need to use IPC::SYSTEM:SIMPLE to do this right.

 
 

Actually, IPC::SIMPLE:SYSTEM isn’t in debian, and it seems like that from reading perldoc -f system:

The return value is the exit status of the program as returned by the "wait" call.  To get the actual exit value, shift right by eight.

Which means that I think if we put the following at the bottom of the wrapper, we will be set:

exit($? >> 8)

 
 

Well, I tried that… and we were most definitely not set. This resulted in all the messages that were not getting spam scanned, to get deferred. The messages are probably due to the spam server being overwhelmed and spitting out:

Sep 29 14:40:14 albatross spamc[4691]: connect to spamd on 0.0.0.0 failed, retrying (#1 of 3): Interrupted system call

Which probably has a specific error-code associated with it, when postfix sees that error code, it decides that the pipe command to the maildrop wrapper failed, and so it sticks the message into the deferred queue.

Why doesn’t the message just get delivered when the spam scanning fails? I thought we had exceptions around our maildroprc so that this is how it should happen?

 
 

Well, we are now running with this:

#!/usr/bin/perl
#
# THIS FILE IS MANAGED BY PUPPET
#
# spamc produces:
#   EX_USAGE        64  command line usage error
#   EX_DATAERR      65  data format error
#   EX_NOINPUT      66  cannot open input
#   EX_NOUSER       67  addressee unknown
#   EX_NOHOST       68  host name unknown
#   EX_UNAVAILABLE  69  service unavailable
#   EX_SOFTWARE     70  internal software error
#   EX_OSERR        71  system error (e.g., can’t fork)
#   EX_OSFILE       72  critical OS file missing
#   EX_CANTCREAT    73  can’t create (user) output file
#   EX_IOERR        74  input/output error
#   EX_TEMPFAIL     75  temp failure; user is invited to retry
#   EX_PROTOCOL     76  remote error in protocol
#   EX_NOPERM       77  permission denied
#   EX_CONFIG       78  configuration error
# 

# maildrop produces:
#   75 when quota is exceeded, with this message:
#      maildrop: quota exceeded
#   75 when it connection to spamc times out, with this message:
#      maildrop: Timeout quota exceeded
#
#   77 is a permission denied, we aren't sure, but it might be produced when the user is over quota?
#   0 on successful delivery
#
# how postfix responds:
#   100 => permanent failure, create bounce message
#   75  => temp failure, requeue
#   0   => success, postfix assumes the message was delivered ok.

use String::ShellQuote;
my $safe_args = shell_quote(@ARGV);
$error_msg = `/usr/bin/maildrop $safe_args 2>&1`;
$code = $? >> 8;
`/usr/bin/logger -t wrapper -pmail.info "WRAPPER: maildrop exit = $code; safe_args = $safe_args; error_msg = $error_msg"`;

if ($code == 0) {
    `/usr/bin/logger -t wrapper -pmail.info "WRAPPER: exit = $code; safe_args = $safe_args; error_msg = $error_msg"`;
    exit(0);    # success
} elsif ($error_msg =~ /timeout/is) {
    `/usr/bin/logger -t wrapper -pmail.info "WRAPPER: exit = $code; safe_args = $safe_args; error_msg = $error_msg"`;
    exit(75);   # temp failure
} else {
    `/usr/bin/logger -t wrapper -pmail.info "WRAPPER: exit = $code; safe_args = $safe_args; error_msg = $error_msg"`;
    exit(100);  # perm failure
}

Right now its somewhat strange, because people who are over quota are bouncing messages that don’t say that they are over quota:

<zygut@riseup.net>: Command died with status 100:
    "/usr/local/bin/maildrop_wrapper"

I think I solved this by adding a print STDERR $error_msg; right above the exit(100). This way the proper $error_msg will be returned back to postfix so that it can be included in the message.

 
   

it was getting late last night when we were discussing this. i remember you guys saying something about spamc -x option. i was trying to understand spamc options by reading http://search.cpan.org/dist/Mail-SpamAssassin/spamc/spamc.pod#OPTIONS

it seems -d and -x are incompatible and that perhaps we could use -E so in case spamc/spamd return 1 message is spam and in case it returns 0 we just deliver normally.