Postfix crashing under load

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Postfix crashing under load

Devdas Bhagat
The last error messages I get are these:
Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
<snip about 600 similar lines about this problem>
Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout
Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout


postconf -n is:
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
data_directory = /var/lib/postfix
debug_peer_level = 2
html_directory = /usr/share/doc/postfix-2.5.2-documentation/html
inet_interfaces = all
mail_owner = postfix
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
max_use = 100000
maximal_backoff_time = 900s
minimal_backoff_time = 600s
mydestination = $myhostname, localhost.$mydomain, localhost
newaliases_path = /usr/bin/newaliases.postfix
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.5.2-documentation/readme
relay_destination_concurrency_limit = 1000
relay_domains = regexp:/etc/postfix/relay
relay_recipient_maps = regexp:/etc/postfix/relay
relayhost = [redacted-trap]
sample_directory = /usr/share/doc/postfix-2.5.2/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
smtpd_recipient_restrictions = check_policy_service inet:[127.0.0.1]:2025
                                check_sender_access hash:/etc/postfix/sender_access
                          check_client_access hash:/etc/postfix/aol_server_rejects
                                check_client_access hash:/etc/postfix/dnswl_rejects
                                check_client_access hash:/etc/postfix/whitelisted_clients
                                check_recipient_access hash:/etc/postfix/recipient_access
                                reject_invalid_hostname
                                reject_unknown_hostname
                                reject_rbl_client cbl.abuseat.org
                                reject_rbl_client dnsbl.sorbs.net
                                reject_rbl_client aspews.ext.sorbs.net
                                reject_unauth_destination
unknown_hostname_reject_code = 550
unknown_local_recipient_reject_code = 550


This is a heavily loaded server. Suggestions on cause(s) and fixes?

Devdas Bhagat
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Brian Evans - Postfix List
Devdas Bhagat wrote:

> The last error messages I get are these:
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
> <snip about 600 similar lines about this problem>
> Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout
> Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout
>
>
> postconf -n is:
>  
[...]
> relay_domains = regexp:/etc/postfix/relay
> relay_recipient_maps = regexp:/etc/postfix/relay
>  

This looks potentially bad to me, but without knowing what is in that
/etc/postfix/relay map, it's hard to judge.
> relayhost = [redacted-trap]
>  

> smtpd_recipient_restrictions = check_policy_service inet:[127.0.0.1]:2025
>               check_sender_access hash:/etc/postfix/sender_access
>           check_client_access hash:/etc/postfix/aol_server_rejects
>                check_client_access hash:/etc/postfix/dnswl_rejects
> check_client_access hash:/etc/postfix/whitelisted_clients
> check_recipient_access hash:/etc/postfix/recipient_access
> reject_invalid_hostname
> reject_unknown_hostname
> reject_rbl_client cbl.abuseat.org
> reject_rbl_client dnsbl.sorbs.net
> reject_rbl_client aspews.ext.sorbs.net
> reject_unauth_destination
>  

This is a potential open relay.
If check_sender_access returns or check_recipient_access an OK, then it
is.  They should return permit_auth_destination for the simple fact that
they are easily forged.  Easy fix: move reject_unauth_destination to the
first position

Employ and enforce SASL for untrusted networks.
> This is a heavily loaded server. Suggestions on cause(s) and fixes?
>
>  
Rethink your "relay" service or post more on what is in the maps discussed.

Spammers can eat you alive if you let them.

Brian
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Victor Duchovni
In reply to this post by Devdas Bhagat
On Mon, Sep 08, 2008 at 10:35:40PM +0530, Devdas Bhagat wrote:

> The last error messages I get are these:
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
> <snip about 600 similar lines about this problem>

Master daemon freezes and is unable to spawn any new processes.

> Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout

After a 1000s delay, master bails out, so the problem started 16 minutes
and 40 seconds before 14:10:56, i.e at 13:53:16.

> Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout

The queue manager was also frozen. What happened at ~13:53 ???

The master received no events for 1000 seconds, do you have a 60 second
wakeup timer for pickup in the master.cf? Or a 300s timer for qmgr?

Perhaps the O/S incorrectly reports a full qmgr FIFO as being ready,
and then master blocks trying to write a wakekup trigger (one byte)
to the fifo? But that still leaves the question as to why qmgr is
frozen open...

This looks like an O/S resource issue...

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Wietse Venema
In reply to this post by Devdas Bhagat
Devdas Bhagat:
> The last error messages I get are these:
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
> Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
> <snip about 600 similar lines about this problem>
> Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout
> Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout

I think that the kernel is running out of steam.

Try reducing the concurrency.

The master daemon triggers qmgr and pickup regularly. That "trigger"
write is non-blocking with a timeout of 1, so it cannot block the
master daemon. Except of course when the kernel is messed up.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Wietse Venema
Wietse Venema:

> Devdas Bhagat:
> > The last error messages I get are these:
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
> > <snip about 600 similar lines about this problem>
> > Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout
> > Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout
>
> I think that the kernel is running out of steam.
>
> Try reducing the concurrency.
>
> The master daemon triggers qmgr and pickup regularly. That "trigger"
> write is non-blocking with a timeout of 1, so it cannot block the
> master daemon. Except of course when the kernel is messed up.

Hmm, except that write_buf() will retry the write() after en EAGAIN
error. So to be really smart, write_buf() should watch the clock and
break the loop when the time expires.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Victor Duchovni
On Mon, Sep 08, 2008 at 03:31:29PM -0400, Wietse Venema wrote:

> > The master daemon triggers qmgr and pickup regularly. That "trigger"
> > write is non-blocking with a timeout of 1, so it cannot block the
> > master daemon. Except of course when the kernel is messed up.
>
> Hmm, except that write_buf() will retry the write() after en EAGAIN
> error. So to be really smart, write_buf() should watch the clock and
> break the loop when the time expires.

Somewhat related question, if one removes all wakeup timers from
master.cf, will master(8) croak with a watchdog timer if the remaining
services are idle long enough?

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Wietse Venema
In reply to this post by Wietse Venema
Wietse Venema:

> Wietse Venema:
> > Devdas Bhagat:
> > > The last error messages I get are these:
> > > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
> > > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
> > > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
> > > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
> > > <snip about 600 similar lines about this problem>
> > > Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout
> > > Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout
> >
> > I think that the kernel is running out of steam.
> >
> > Try reducing the concurrency.
> >
> > The master daemon triggers qmgr and pickup regularly. That "trigger"
> > write is non-blocking with a timeout of 1, so it cannot block the
> > master daemon. Except of course when the kernel is messed up.
>
> Hmm, except that write_buf() will retry the write() after en EAGAIN
> error. So to be really smart, write_buf() should watch the clock and
> break the loop when the time expires.

If this is the problem, the workaround would be to break the
loop after EAGAIN. That would keep the master from timing out.

You'd still have a deadlocked qmgr for 1000s, though.

        Wietse

ssize_t write_buf(int fd, const char *buf, ssize_t len, int timeout)
{
    const char *start = buf;
    ssize_t count;

    while (len > 0) {
        if (timeout > 0 && write_wait(fd, timeout) < 0)
            return (-1);
        if ((count = write(fd, buf, len)) < 0) {
#if 0
            if (errno == EAGAIN && timeout > 0)
                continue;
#endif
            if (errno == EINTR)
                continue;
            return (-1);
        }

Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Wietse Venema
In reply to this post by Victor Duchovni
Victor Duchovni:

> On Mon, Sep 08, 2008 at 03:31:29PM -0400, Wietse Venema wrote:
>
> > > The master daemon triggers qmgr and pickup regularly. That "trigger"
> > > write is non-blocking with a timeout of 1, so it cannot block the
> > > master daemon. Except of course when the kernel is messed up.
> >
> > Hmm, except that write_buf() will retry the write() after en EAGAIN
> > error. So to be really smart, write_buf() should watch the clock and
> > break the loop when the time expires.
>
> Somewhat related question, if one removes all wakeup timers from
> master.cf, will master(8) croak with a watchdog timer if the remaining
> services are idle long enough?

I don't care. Postfix without queue manager makes no sense,
and having no wakeup on the queue manager is insane.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Devdas Bhagat
In reply to this post by Brian Evans - Postfix List
On Mon, Sep 08, 2008 at 01:23:53PM -0400, Brian Evans - Postfix List wrote:

> > relay_recipient_maps = regexp:/etc/postfix/relay
> >  
>
> This looks potentially bad to me, but without knowing what is in that
> /etc/postfix/relay map, it's hard to judge.
> > relayhost = [redacted-trap]
> >  
>
> > smtpd_recipient_restrictions = check_policy_service inet:[127.0.0.1]:2025
> >               check_sender_access hash:/etc/postfix/sender_access
> >           check_client_access hash:/etc/postfix/aol_server_rejects
> >                check_client_access hash:/etc/postfix/dnswl_rejects
> > check_client_access hash:/etc/postfix/whitelisted_clients
> > check_recipient_access hash:/etc/postfix/recipient_access
> > reject_invalid_hostname
> > reject_unknown_hostname
> > reject_rbl_client cbl.abuseat.org
> > reject_rbl_client dnsbl.sorbs.net
> > reject_rbl_client aspews.ext.sorbs.net
> > reject_unauth_destination
> >  
>
> This is a potential open relay.

Nah, it's sending mail to exactly the correct servers. There's a reason
for this host to have a relayhost setting, and for me to redact it.

Look at the name of the relayhost :P

> If check_sender_access returns or check_recipient_access an OK, then it
> is.  They should return permit_auth_destination for the simple fact that
> they are easily forged.  Easy fix: move reject_unauth_destination to the
> first position

That would just increase the amount of mail the relayhost needs to process
for no appreciable benefit.

Devdas Bhagat
Reply | Threaded
Open this post in threaded view
|

Re: Postfix crashing under load

Devdas Bhagat
In reply to this post by Wietse Venema
On Mon, Sep 08, 2008 at 03:27:31PM -0400, Wietse Venema wrote:

> Devdas Bhagat:
> > The last error messages I get are these:
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7998]: warning: problem talking to service private/scache: Connection timed out
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[20375]: warning: problem talking to service private/scache: Connection timed out
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[7960]: warning: problem talking to service private/scache: Connection timed out
> > Sep  8 13:54:37 jaundiced-outlook postfix/smtp[17618]: warning: problem talking to service private/scache: Connection timed out
> > <snip about 600 similar lines about this problem>
> > Sep  8 14:10:56 jaundiced-outlook postfix/master[11125]: fatal: watchdog timeout
> > Sep  8 14:10:56 jaundiced-outlook postfix/qmgr[13568]: fatal: watchdog timeout
>
> I think that the kernel is running out of steam.
>
> Try reducing the concurrency.
>
> The master daemon triggers qmgr and pickup regularly. That "trigger"
> write is non-blocking with a timeout of 1, so it cannot block the
> master daemon. Except of course when the kernel is messed up.

Hmm, this is
Linux 2.6.9-67.0.1.EL #1 Fri Nov 30 11:41:37 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
on a RHEL 4 box.

I'll lower the concurrency and see if the system stabilizes.

Devdas Bhagat