Monitoring amount of smtpd processes

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Monitoring amount of smtpd processes

Peer Heinlein



Hi,

we're monitoring the amount of active smtpd processes to make sure, that
we do not reach the max-proc limit from master.cf.

If a client disconnects very early, the smtpd is still "unused" and
remains in server memory, waiting for the next connection.

If a server was flooded with a short peak of new connections, a server
could have $process_limit instances remaining ready-to-tun in memory.

In that situations we're seeing false positives in our monitoring.

I can't see a way how to detect those "waiting" smtpd to cound them
differently in the process list. AFAIK there's now way (except we're
counting the number of open connections with lsof/netstat).

What about the idea that Postfix flags those unused processes by
renaming them in the output of "ps"?

Dovecot has a "verbose proctitle" option where pop3/imap processes are
renamed in the process list so that they're showing the logged in user,
the state of TLS, the client IP and the last IMAP-command.

It could also be very great to have Postfix like this, showing some
informations about the connection:

smtpd [unused/virgin]
or
smtpd [<sasl_username>, <tls yes|no>, <client-ip>, <smtp_command>]

Could be great for analysis and to get a quick overview about what's
going on on busy servers.

Peer


--
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-42
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht
Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin

Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Stefan Bauer-2
We simply monitor established tcp sessions to smtpd port. if client flies away, tcp session does as well:

lsof -i tcp:25 | grep ESTABLISHED | wc -l

Am Samstag, 20. Oktober 2018 schrieb Peer Heinlein :

>
>
>
> Hi,
>
> we're monitoring the amount of active smtpd processes to make sure, that
> we do not reach the max-proc limit from master.cf.
>
> If a client disconnects very early, the smtpd is still "unused" and
> remains in server memory, waiting for the next connection.
>
> If a server was flooded with a short peak of new connections, a server
> could have $process_limit instances remaining ready-to-tun in memory.
>
> In that situations we're seeing false positives in our monitoring.
>
> I can't see a way how to detect those "waiting" smtpd to cound them
> differently in the process list. AFAIK there's now way (except we're
> counting the number of open connections with lsof/netstat).
>
> What about the idea that Postfix flags those unused processes by
> renaming them in the output of "ps"?
>
> Dovecot has a "verbose proctitle" option where pop3/imap processes are
> renamed in the process list so that they're showing the logged in user,
> the state of TLS, the client IP and the last IMAP-command.
>
> It could also be very great to have Postfix like this, showing some
> informations about the connection:
>
> smtpd [unused/virgin]
> or
> smtpd [<sasl_username>, <tls yes|no>, <client-ip>, <smtp_command>]
>
> Could be great for analysis and to get a quick overview about what's
> going on on busy servers.
>
> Peer
>
>
> --
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-42
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht
> Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Wietse Venema
In reply to this post by Peer Heinlein
Peer Heinlein:
>
> Hi,
>
> we're monitoring the amount of active smtpd processes to make sure, that
> we do not reach the max-proc limit from master.cf.
>
> If a client disconnects very early, the smtpd is still "unused" and
> remains in server memory, waiting for the next connection.

The Postfix behavior has nothing to do with the duration of an SMTP
session. It is determined by the max_idle setting in main.cf.

> If a server was flooded with a short peak of new connections, a server
> could have $process_limit instances remaining ready-to-tun in memory.

You would see the same with a sustained peak of one minute long.
It does not depend on the length of SMTP sessions.

> In that situations we're seeing false positives in our monitoring.

Please fix your monitoring!

Here is an idea: you could reduce the max_idle setting to 10s.
Systems are a bit faster now than they were 20 years ago.

> I can't see a way how to detect those "waiting" smtpd to cound them
> differently in the process list. AFAIK there's now way (except we're
> counting the number of open connections with lsof/netstat).

Yes, that would be an idea.

Another idea is to have the master write per-service idle/busy
status to a memory-mapped file.

> What about the idea that Postfix flags those unused processes by
> renaming them in the output of "ps"?

That is very system-dependent. Writing over argv[] is what people
did in 1982, but such code was making assumptions that were never
supported by any promise.

FreeBSD: has setproctitle(), similar to NetBSD and OpenBSD.

    Conclusion: this is a stable API that is safe to use.

Solaris: not supported. Some code tinkers with argv[], but the
    result differs between /usr/bin/ps and /usr/ucb/ps, and the
    change is visible only to root, or to processes that run with
    the same UID as the process that changes its argv[].

    Conclusion: not worth the trouble.

Linux: kludges galore. Some code involves not one, but two approaches:
    modify argv[], and set no more than 16 bytes with prctl(PR_SET_NAME).
    One approach affects ps -a output, the other affects ps -ax
    output. That's because one approach affects /proc/pid/this, and
    the other affects /proc/pid/that.

    Conclusion: prctl(PR_SET_NAME) is safe to use. I would not
    distribute Postfix's own version of the stinking pile of garbage
    that mucks direclty with argv[].

> Dovecot has a "verbose proctitle" option where pop3/imap processes are
> renamed in the process list so that they're showing the logged in user,
> the state of TLS, the client IP and the last IMAP-command.

I would accept an implementation that uses well-defined APIs only:
that is setproctitle() on systems that are known to have it, and
prctl(PR_SET_NAME) on recent enough Linux systems. If a future Linux
version has setproctitle(), then I would use that instead. Until
then you'd have to use the 'right' ps command or read the 'right'
/proc file.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Shawn Heisey-2
In reply to this post by Peer Heinlein
On 10/20/2018 7:24 AM, Peer Heinlein wrote:

> we're monitoring the amount of active smtpd processes to make sure, that
> we do not reach the max-proc limit from master.cf.
>
> If a client disconnects very early, the smtpd is still "unused" and
> remains in server memory, waiting for the next connection.
>
> If a server was flooded with a short peak of new connections, a server
> could have $process_limit instances remaining ready-to-tun in memory.
>
> In that situations we're seeing false positives in our monitoring.

The number I found most useful to indicate something was going wrong is
the number of messages in the queue.  For the servers I manage, normally
that number would be single digit, maybe get to two digits on occasion.

When something gets broken, the number of messages in the queue tends to
balloon.  There are two primary causes I've seen for a large queue:  1)
A particularly massive email storm, either spam or internally generated
messages.  2) Delivery problems. There are lots of things that can cause
delivery problems.  The most common problem I ran into was one of the
webservers deciding that it needed to send thousands of messages. 
Waiting for those to clear out on their own so normal mail can make it
through could take DAYS.

I would typically get notified about a problem with email after an hour
or two where no messages were getting through, which is why I eventually
added a monitor for the queue size, so I could know about the problem
BEFORE it was noticed by high-profile people at the company.  With that,
I could fix the problem quickly and find the right developer to chew out
for sending thousands of messages.

For a particularly busy server, you probably would want to set the queue
size alarm threshold at a fairly large number (at least 1000), but for
one that's not very busy, more than about 100 is probably enough of a
reason to investigate and see if there's a problem.  Calculating the
total size of the message queue would be as simple as looking at the
contents of some of the directories in /var/spool/postfix.  You could
potentially run the 'mailq' command and parse its output, but I have
seen that take a REALLY long time to finish, so counting files in the
spool directories is probably better.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Jan P. Kessler

>> we're monitoring the amount of active smtpd processes to make sure, that
>> we do not reach the max-proc limit from master.cf.
>>
>
> The number I found most useful to indicate something was going wrong
> is the number of messages in the queue.  For the servers I manage,
> normally that number would be single digit, maybe get to two digits on
> occasion.

The topic here is the number of smtpD processes (which serve *incoming*
smtp connections). When the number is set too low, you won't get the
messages in your queue.

Spoken clearly: Unless you're not able to monitor the queues of all
systems that want to send an email to you this is not an option to solve
the described problem. If you are able to do this I'd be very interested
in that code ;)

Cheers, Jan

Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Peer Heinlein
In reply to this post by Wietse Venema
Am 20.10.2018 um 19:06 schrieb Wietse Venema:

Hi,

>> If a client disconnects very early, the smtpd is still "unused" and
>> remains in server memory, waiting for the next connection.
>
> The Postfix behavior has nothing to do with the duration of an SMTP
> session. It is determined by the max_idle setting in main.cf.

max_idle was the option I was looking for. Thank you.

I always grepped for something like timeout/daemon/time and I never
found max_idle. :-)

> You would see the same with a sustained peak of one minute long.
> It does not depend on the length of SMTP sessions.

Yes and No.

If a client connects to smtpd and then breaks the connection because
there's only STARTTLS or AUTH ONLY we have those remaining smtpd
processes -- which makes the server looking busy, while he isn't.

If there's really a long peak then the server IS busy and I WANT to have
an alarm.

>> In that situations we're seeing false positives in our monitoring.
> Please fix your monitoring!

Yes, I do that -- that's why I'm requesting help (thanks for max_idle)
or some additional changes to enable a better monitoring.


>     Conclusion: prctl(PR_SET_NAME) is safe to use. I would not
>     distribute Postfix's own version of the stinking pile of garbage
>     that mucks direclty with argv[].

:-)

I don't understand much about the differences and the different way on
how to implement that. -I'm not a coder, just an Admin.

I only now that this way's working really perfect with Dovecot for me
and it's very helpful to get an quick overview about what's going on and
who is eating up your ressources.

So if this could be implemented some day... I'd appreciate that.

Peer


Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Viktor Dukhovni


> On Oct 21, 2018, at 5:14 PM, Peer Heinlein <[hidden email]> wrote:
>
> If a client connects to smtpd and then breaks the connection because
> there's only STARTTLS or AUTH ONLY we have those remaining smtpd
> processes -- which makes the server looking busy, while he isn't.
>
> If there's really a long peak then the server IS busy and I WANT to have
> an alarm.

You could look for "smtpd" processes with with "-o stress=yes" on their
command-line.  These are spawned by master(8) when process limit has
been hit.

        http://www.postfix.org/STRESS_README.html#adapt
        http://www.postfix.org/STRESS_README.html#feature

The document does not mention one detail you may care to know:

        /*
         * When all servers for a public internet service are busy, we start
         * creating server processes with "-o stress=yes" on the command
         * line, and keep creating such processes until the process count is
         * below the limit for at least 1000 seconds. [...]
         */

So it takes ~16 minutes without hitting the limit before the stress setting
is "relaxed".

On a modern server you can reasonably run around 1000 smtpd(8) processes,
and postscreen(8) should help to keep the typical process count lower than
it would be otherwise.

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Ralf Hildebrandt-2
In reply to this post by Peer Heinlein
> It could also be very great to have Postfix like this, showing some
> informations about the connection:
>
> smtpd [unused/virgin]
> or
> smtpd [<sasl_username>, <tls yes|no>, <client-ip>, <smtp_command>]
>
> Could be great for analysis and to get a quick overview about what's
> going on on busy servers.

That's a nice idea on systems where this kind of change is possible!

--
[*] sys4 AG

https://sys4.de, +49 (89) 30 90 46 64
Schleißheimer Straße 26/MG, 80333 München
                                           
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring amount of smtpd processes

Ralf Hildebrandt-2
In reply to this post by Peer Heinlein
> max_idle was the option I was looking for. Thank you.
>
> I always grepped for something like timeout/daemon/time and I never
> found max_idle. :-)

Lowered here as well...

--
[*] sys4 AG

https://sys4.de, +49 (89) 30 90 46 64
Schleißheimer Straße 26/MG, 80333 München
                                           
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein