Postfix benchmark: bug or performance regression ?

classic Classic list List threaded Threaded
13 messages Options
JRC
Reply | Threaded
Open this post in threaded view
|

Postfix benchmark: bug or performance regression ?

JRC
Hi all!

We used to have postfix 2.6.11 in our systems, which was then updated with no
problems to 3.3.2.
However, during a benchmark, we realized 3.3.2 was 5 times slower than the
version before. Even after disabling all mail filters the slowdown was still
the same.

The benchmark consists of a large amount of emails being fetched and entering
postfix via the smtpd daemon and then handed over to cyrus via lmtp and socket
to be delivered mostly to the same recipient.

At first, we thought maybe the concurrency limits have changed, or maybe now
there are less processes running, but all the configuration variables
regarding this have not changed.

The benchmark uses 30000 emails, but generating 20 emails is sufficient to see
the bug.

After trials and errors and running out of ideas, I decided to bisect the code
to see what made this change.

It happens that version 2.11.1 behaves as slow as 3.3.2, and version 2.11.0 as
fast as 2.6.11. All with the same hardware, same configuration, same load.

With version >= 2.11.1, the first emails are received and processed somewhat
instantly, after some time it seems that lmtp is not processing anything and
just waiting while all emails are fetched, so afterwards it starts delivering
to the recipients at the rate of about 2 per second. (See excerpt of the log
file)

Excerpt of maillog version > 2.11.1:

Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 7079468:
from=<[hidden email]>, size=356, nrcpt=1 (queue active)
Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent
(250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694410-1>)
Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 718F46A:
client=localdomain.com[127.0.0.1]
Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6CEFF61: removed
Mar 27 14:46:50 localdomain postfix/cleanup[24748]: 718F46A: message-
id=<[hidden email]>
Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 718F46A:
from=<[hidden email]>, size=356, nrcpt=1 (queue active)
Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 7273461:
client=localdomain.com[127.0.0.1]
Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent
(250 2.1.5 Ok SESSIONID=<localdomain.com-24753-1553694410-1>)
Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed
...
...
Mar 27 14:46:50 localdomain postfix/cleanup[24748]: 764A06A: message-
id=<[hidden email]>
Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 764A06A:
from=<[hidden email]>, size=356, nrcpt=1 (queue active)
Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 778DC61:
client=localdomain.com[127.0.0.1]
Mar 27 14:46:50 localdomain postfix/cleanup[24758]: 778DC61: message-
id=<[hidden email]>
Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 778DC61:
from=<[hidden email]>, size=356, nrcpt=1 (queue active)
Mar 27 14:46:50 localdomain postfix/smtpd[24746]: 786A96B:
client=localdomain.com[127.0.0.1]
Mar 27 14:46:50 localdomain postfix/cleanup[24758]: 786A96B: message-
id=<[hidden email]>
...
...
Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5
Ok SESSIONID=<localdomain.com-24757-1553694412-1>)
Mar 27 14:46:52 localdomain postfix/qmgr[24339]: 764A06A: removed
Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250
2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694414-1>)
Mar 27 14:46:54 localdomain postfix/qmgr[24339]: 778DC61: removed
...
...
Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250
2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694418-1>)
Mar 27 14:46:58 localdomain postfix/qmgr[24339]: 7CFDE72: removed



With version <= 2.11.0, the messages are delivered as soon as they arrive.

Excerpt of maillog version <= 2.11.0:

...
...
Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AA6CC67: removed
Mar 27 15:00:35 localdomain postfix/smtpd[26446]: AC80663:
client=localdomain.com[127.0.0.1]
Mar 27 15:00:35 localdomain postfix/cleanup[26456]: AC80663: message-
id=<[hidden email]>
Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AC80663:
from=<[hidden email]>, size=356, nrcpt=1 (queue active)
Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AB58261:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5,
status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24769-1553695235-7>)
Mar 27 15:00:35 localdomain postfix/smtpd[26446]: AD4EB67:
client=localdomain.com[127.0.0.1]
Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AB58261: removed
Mar 27 15:00:35 localdomain postfix/cleanup[26456]: AD4EB67: message-
id=<[hidden email]>
Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AD4EB67:
from=<[hidden email]>, size=356, nrcpt=1 (queue active)
Mar 27 15:00:35 localdomain postfix/smtpd[26446]: disconnect from
localdomain.com[127.0.0.1]
Mar 27 15:00:35 localdomain postfix/lmtp[26453]: AC80663:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5,
status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-26455-1553695235-5>)
Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AC80663: removed
Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AD4EB67:
to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
imap/socket/lmtp], delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250
2.1.5 Ok SESSIONID=<localdomain.com-26452-1553695235-7>)
Mar 27 15:00:35 localdomain postfix/qmgr[26443]: AD4EB67: removed



Going deeper with bisecting the code, I found out that the following line
created this behaviour change:

diff --git a/postfix/src/smtp/smtp.h b/postfix/src/smtp/smtp.h
index 336a4f47..5437088e 100644
--- a/postfix/src/smtp/smtp.h
+++ b/postfix/src/smtp/smtp.h
@@ -195,7 +195,7 @@ typedef struct SMTP_STATE {
        STR((state)->iterator->request_nexthop)[0] = 0; \
     }
 
-#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop) !=
0)
+#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop)[0]
!= 0)
 
 
  /*

 
This changes the result of the conditions in a few places throughout the smtp
code. Also, "request_nexthop" can only acquire a value different than 0 during
the method "smtp_connect_inet", which probably shouldn't run during a unix
socket connection.

Questions:

(1) Does it mean that the lmtp connections via socket are never cached or
reused?
(2) Is this expected behaviour? Is this a bug?



Other configs and outputs:


Excerpt from main.cf: (only the configs that I believe are pertinent)

mailbox_transport = lmtp:unix:/var/imap/socket/lmtp
lmtp_destination_recipient_limit = 1
local_destination_concurrency_limit = 2


Output from scache after 20 emails:

version <= 2.11.0:

postfix/scache[26740]: statistics: start interval Mar 27 15:20:50
postfix/scache[26740]: statistics: domain lookup hits=7 miss=8 success=46%
postfix/scache[26740]: statistics: max simultaneous domains=1 addresses=1
connection=2


version >= 2.11.1:

postfix/scache[29823]: statistics: start interval Mar 27 15:24:59
postfix/scache[29823]: statistics: domain lookup hits=0 miss=17 success=0%
postfix/scache[29823]: statistics: max simultaneous domains=0 addresses=1
connection=4


Some help to make sense of all of it would be really high appreciated.

Julie.


Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Viktor Dukhovni
On Wed, Mar 27, 2019 at 03:36:28PM +0100, Juliana Rodrigueiro wrote:

> However, during a benchmark, we realized 3.3.2 was 5 times slower than the
> version before.

This is misleading.  Postfix is not 5 times slower, your benchmark
appears to be measuring the LMTP delivery rate to a single sink
mailbox, and this artificial work-load may be slower under some
adverse conditions (below...).

> With version >= 2.11.1, the first emails are received and processed somewhat
> instantly,

How many?  Are those connections closed or kept open in scache?

> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 7079468:
> from=<[hidden email]>, size=356, nrcpt=1 (queue active)
> Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent
> (250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694410-1>)

> Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent
> (250 2.1.5 Ok SESSIONID=<localdomain.com-24753-1553694410-1>)

> Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5
> Ok SESSIONID=<localdomain.com-24757-1553694412-1>)

This message waited 2s in the active queue, before it was its turn
to go, this suggests some other deliveries were still in progress...

> Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250

This delivery took 2 seconds to complete a connection, why?

> Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250

This delivery spent 6s waiting in the active queue and 2s to complete
a connection.  Why are connections to the LMTP server so expensive?
Is your syslog service logging synchronously?  Perhaps logging of
connection establishment (in the LMTP server) is causing congestion
at LMTP connect?  What is the LMTP server doing in connect and LHLO
processing?

> Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AB58261:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5,
> status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-24769-1553695235-7>)

This shows LMTP connection re-use, but that should not make much
difference with unix-domain sockets.  Local connections are supposed
to be fast.

> Mar 27 15:00:35 localdomain postfix/lmtp[26453]: AC80663:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5,
> status=sent (250 2.1.5 Ok SESSIONID=<localdomain.com-26455-1553695235-5>)

Again re-use, but the next connection is fresh:

> Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AD4EB67:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250
> 2.1.5 Ok SESSIONID=<localdomain.com-26452-1553695235-7>)

> Going deeper with bisecting the code, I found out that the following line
> created this behaviour change:
>
> -#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop) !=
> 0)
> +#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop)[0] != 0)

That's a necessary bugfix, without it the test is always true.

> Questions:
>
> (1) Does it mean that the lmtp connections via socket are never cached or reused?

There is no caching by nexthop domain, since there is no nexthop
domain for unix-domain LMTP sockets.  All there is is a nexthop
socket address.  So session caching, if any, would naturally be
just by endpoint address.  However, in 2.11 the smtp_connect_local()
function also changed:

    https://github.com/vdukhovni/postfix/commit/072fa2d1f7beaf41756c30a69ef96cb611461cec#diff-24650cb98191c311688117b61626fffbR506

to use smtp_reuse_nexthop() rather than smtp_reuse_addr(), so that
SASL authenticated connections might be re-used, but this looks
like a mistake, since we're therefore creating cached sessions that
will never be re-used, but that tie up LMTP server resources,
possibly leading to subsequent congestion under load.

> (2) Is this expected behaviour? Is this a bug?

There's likely a bug.  We should either simulate a synthetic nexthop
($myhostname?) for unix-domain destinations, and then do nexthop
reuse (and perhaps do no caching by endpoint address for unix-domain
destinations), or go back to using re-use by endpoint address.
Currently there's a mismatch, and we're keeping connections alive
that can't be reused.

The fix will be easy, once we decide on the fine details.

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Wietse Venema
In reply to this post by JRC
Juliana Rodrigueiro:

> Excerpt of maillog version > 2.11.1:
> Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent
> (250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694410-1>)
> Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent
> (250 2.1.5 Ok SESSIONID=<localdomain.com-24753-1553694410-1>)
> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed

The a/b/c/d delays logging shows:
a = time from message arrival to last active queue entry
b = time from last active queue entry to connection setup
c = time in connection setup, including DNS, EHLO and STARTTLS
d = time in message transmission

Here the delays are 0.01s or less.

> Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5
> Ok SESSIONID=<localdomain.com-24757-1553694412-1>)

The message had to wait 2s until a delivery agent became available, for
reasons that will become clear later.

> Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250
> 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694414-1>)

The message had to wait 2s until a delivery agent became available,
and then it spent 2s in connection setup (including DNS lookup).

> Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72:
> to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250
> 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694418-1>)

This message had to wait 6s until a delivery agent became available,
and then it spent 2s in connection setup (including DNS lookup).

There is a clear pattern in here: under load, the LMTP client spends
2s in connection setup (including DNS lookup).

Also, the connections are not reused, which may be the result of a
Postfix change that I will look into.

But connection reuse is an optimization; it should never take 2s
to set up an LMTP connection, not even if the LMTP server is on the
other side of the world.

The core problem is that under load, it takes 2s to set up an LMTP
connection. If you can fix that, then Postfix will perform fine
with or without connection reuse.

What kind of LMTP server are you using?

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Viktor Dukhovni
In reply to this post by Viktor Dukhovni
> On Mar 27, 2019, at 3:01 PM, Viktor Dukhovni <[hidden email]> wrote:
>
> There's likely a bug.  We should either simulate a synthetic nexthop
> ($myhostname?) for unix-domain destinations, and then do nexthop
> reuse (and perhaps do no caching by endpoint address for unix-domain
> destinations), or go back to using re-use by endpoint address.
> Currently there's a mismatch, and we're keeping connections alive
> that can't be reused.

And of course the nexthop can't be "$myhostname".  If there are multiple
LMTP unix-domain socket destinations, we must not mix them up, so the
nexthop needs to be just the socket pathname.

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Wietse Venema
In reply to this post by Wietse Venema
Wietse Venema:

> Juliana Rodrigueiro:
> > Excerpt of maillog version > 2.11.1:
> > Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61:
> > to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent
> > (250 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694410-1>)
> > Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63:
> > to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent
> > (250 2.1.5 Ok SESSIONID=<localdomain.com-24753-1553694410-1>)
> > Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed
>
> The a/b/c/d delays logging shows:
> a = time from message arrival to last active queue entry
> b = time from last active queue entry to connection setup
> c = time in connection setup, including DNS, EHLO and STARTTLS
> d = time in message transmission
>
> Here the delays are 0.01s or less.
>
> > Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A:
> > to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5
> > Ok SESSIONID=<localdomain.com-24757-1553694412-1>)
>
> The message had to wait 2s until a delivery agent became available, for
> reasons that will become clear later.
>
> > Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61:
> > to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250
> > 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694414-1>)
>
> The message had to wait 2s until a delivery agent became available,
> and then it spent 2s in connection setup (including DNS lookup).
>
> > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72:
> > to=<[hidden email]>, orig_to=<dst@localhost>, relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250
> > 2.1.5 Ok SESSIONID=<localdomain.com-24421-1553694418-1>)
>
> This message had to wait 6s until a delivery agent became available,
> and then it spent 2s in connection setup (including DNS lookup).
>
> There is a clear pattern in here: under load, the LMTP client spends
> 2s in connection setup (including DNS lookup).
>

To get rid of the 2s delays:

/etc/postfix/main.cf:
    lmtp_connection_cache_on_demand = no

Please let us know if that helps. Meanwhile we can develop a proper fix.

        Wietse
JRC
Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

JRC
In reply to this post by Viktor Dukhovni
On Wednesday, 27 March 2019 20:01:49 CET Viktor Dukhovni wrote:
> On Wed, Mar 27, 2019 at 03:36:28PM +0100, Juliana Rodrigueiro wrote:
> > However, during a benchmark, we realized 3.3.2 was 5 times slower than the
> > version before.
>
> This is misleading.  Postfix is not 5 times slower, your benchmark
> appears to be measuring the LMTP delivery rate to a single sink
> mailbox, and this artificial work-load may be slower under some
> adverse conditions (below...).
>

Although the benchmark produces a very large artificial work-load, this
scenario is still a very valid use case, even more considering that 20 emails
are enough to clog up deliveries.

The stark difference between the LMTP server (cyrus 2.4) log messages is that
most deliveries reuse the same session id when running postfix <= 2.11.0:

Mar 28 10:22:22 localdomain lmtpunix[4881]: auditlog: append
sessionid=<localdomain.com-4881-1553764942-1> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28047>
guid=<521e5e4a35095385881941a77f3035f6087c8942>
Mar 28 10:22:22 localdomain lmtpunix[4881]: Delivered:
<[hidden email]> to mailbox: user.dst
Mar 28 10:22:22 localdomain lmtpunix[4881]: USAGE dst user: 0.000000 sys:
0.000000
Mar 28 10:22:22 localdomain lmtpunix[4926]: auditlog: append
sessionid=<localdomain.com-4926-1553764942-1> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28048>
guid=<baa35444706bddf1963624e672905db77cce131a>
Mar 28 10:22:22 localdomain lmtpunix[4926]: Delivered:
<[hidden email]> to mailbox: user.dst
...
...
Mar 28 10:22:22 localdomain lmtpunix[4926]: auditlog: append
sessionid=<localdomain.com-4926-1553764942-5> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28060>
guid=<0dc9e922f8b8cfc01ccc5bfeafa8220812adf8fa>
Mar 28 10:22:22 localdomain lmtpunix[4926]: Delivered:
<[hidden email]> to mailbox: user.dst
Mar 28 10:22:22 localdomain lmtpunix[4926]: USAGE dst user: 0.000000 sys:
0.004000
Mar 28 10:22:22 localdomain lmtpunix[6449]: auditlog: append
sessionid=<localdomain.com-6449-1553764942-5> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28061>
guid=<781d02ef98dde39e03461a60aa18beb858f8a5d2>
Mar 28 10:22:22 localdomain lmtpunix[6449]: Delivered:
<[hidden email]> to mailbox: user.dst


In the other hand, for version >= 2.11.1, the sessions are used once per
process.

Mar 28 10:04:06 localdomain lmtpunix[4273]: Delivered:
<[hidden email]> to mailbox: user.dst
Mar 28 10:04:06 localdomain lmtpunix[4273]: USAGE dst user: 0.004000 sys:
0.000000
Mar 28 10:04:06 localdomain lmtpunix[4292]: auditlog: append
sessionid=<localdomain.com-4292-1553763846-1> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28032>
guid=<165f88d841750f2cc9a59a74e777ada4c884dae3>
Mar 28 10:04:06 localdomain lmtpunix[4292]: Delivered:
<[hidden email]> to mailbox: user.dst
Mar 28 10:04:06 localdomain lmtpunix[4292]: USAGE dst user: 0.000000 sys:
0.000000
Mar 28 10:04:06 localdomain lmtpunix[4295]: auditlog: append
sessionid=<localdomain.com-4295-1553763846-1> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28033>
guid=<f8c75569e69424c3b529a6e5c228e416518c48d5>
...
...
Mar 28 10:04:10 localdomain lmtpunix[4295]: auditlog: append
sessionid=<localdomain.com-4295-1553763850-1> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28041>
guid=<7ba5808ec13795785ab5a5737cfef97ff24d307a>
Mar 28 10:04:10 localdomain lmtpunix[4295]: Delivered:
<[hidden email]> to mailbox: user.dst
Mar 28 10:04:10 localdomain lmtpunix[4295]: USAGE dst user: 0.000000 sys:
0.004000
Mar 28 10:04:12 localdomain lmtpunix[4273]: auditlog: append
sessionid=<localdomain.com-4273-1553763852-1> mailbox=<user.dst>
uniqueid=<3aab65ba5c98d020> uid=<28042>
guid=<8ee41c08e69e9cdf9be61b2f2bad1a2d2c590f12>
Mar 28 10:04:12 localdomain lmtpunix[4273]: Delivered:
<[hidden email]> to mailbox: user.dst

> > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72:
> > to=<[hidden email]>, orig_to=<dst@localhost>,
> > relay=localdomain.com[/var/ imap/socket/lmtp], delay=8.1,
> > delays=0/6/2/0.02, dsn=2.1.5, status=sent (250
> This delivery spent 6s waiting in the active queue and 2s to complete
> a connection.  Why are connections to the LMTP server so expensive?
> Is your syslog service logging synchronously?  Perhaps logging of
> connection establishment (in the LMTP server) is causing congestion
> at LMTP connect?  What is the LMTP server doing in connect and LHLO
> processing?
>

Unlikely, the LMTP server doesn't log any more or less than it did before.

> > -#define HAVE_NEXTHOP_STATE(state)
> > (STR((state)->iterator->request_nexthop) != 0)
> > +#define HAVE_NEXTHOP_STATE(state)
> > (STR((state)->iterator->request_nexthop)[0] != 0)
> That's a necessary bugfix, without it the test is always true.
>

I understand the necessary fix, and that is why I was puzzled by the behaviour
change.

> > Questions:
> >
> > (1) Does it mean that the lmtp connections via socket are never cached or
> > reused?
> There is no caching by nexthop domain, since there is no nexthop
> domain for unix-domain LMTP sockets.  All there is is a nexthop
> socket address.  So session caching, if any, would naturally be
> just by endpoint address.  However, in 2.11 the smtp_connect_local()
> function also changed:
>
>    
> https://github.com/vdukhovni/postfix/commit/072fa2d1f7beaf41756c30a69ef96cb
> 611461cec#diff-24650cb98191c311688117b61626fffbR506
>
> to use smtp_reuse_nexthop() rather than smtp_reuse_addr(), so that
> SASL authenticated connections might be re-used, but this looks
> like a mistake, since we're therefore creating cached sessions that
> will never be re-used, but that tie up LMTP server resources,
> possibly leading to subsequent congestion under load.
>

It would then make sense that the LMTP server is busy with the open
connections until it is finally closed after an inactivity timeout (probably
the 2 seconds we see). Also, in my setup there are maximum 4 LMTP process,
which then caps the deliveries by 2 emails per second.

On Wednesday, 27 March 2019 20:24:29 CET Wietse Venema wrote:
>
> What kind of LMTP server are you using?
>

I'm using cyrus 2.4.

>
> To get rid of the 2s delays:
>
> /etc/postfix/main.cf:
>     lmtp_connection_cache_on_demand = no
>
> Please let us know if that helps. Meanwhile we can develop a proper fix.

And yes, it worked, that helped a lot. Although not as fast as before, but now
I see only reasonable delays.

Thank you very much for looking into it!!!

Julie.






Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Wietse Venema
Juliana Rodrigueiro:

> > To get rid of the 2s delays:
> >
> > /etc/postfix/main.cf:
> >     lmtp_connection_cache_on_demand = no
> >
> > Please let us know if that helps. Meanwhile we can develop a proper fix.
>
> And yes, it worked, that helped a lot. Although not as fast as before, but now
> I see only reasonable delays.
>
> Thank you very much for looking into it!!!

And thank you for your thorough investigation that helped to narrow
down the root cause: under high traffic conditions, LMTP connections
are cached but never reused, therefore those idle cached connections
are exhausting server resources.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Viktor Dukhovni
> On Mar 28, 2019, at 12:03 PM, Wietse Venema <[hidden email]> wrote:
>
> And thank you for your thorough investigation that helped to narrow
> down the root cause: under high traffic conditions, LMTP connections
> are cached but never reused, therefore those idle cached connections
> are exhausting server resources.

Yes, thanks.  I was similarly impressed by the OP's detailed report.

For the record, in case anyone has not read the entire thread, the issue
is only with LMTP over unix-domain sockets.  LMTP over TCP is not affected.

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

PATCH: Postfix benchmark: bug or performance regression ?

Wietse Venema
In reply to this post by Wietse Venema
Wietse Venema:

> Juliana Rodrigueiro:
> > > To get rid of the 2s delays:
> > >
> > > /etc/postfix/main.cf:
> > >     lmtp_connection_cache_on_demand = no
> > >
> > > Please let us know if that helps. Meanwhile we can develop a proper fix.
> >
> > And yes, it worked, that helped a lot. Although not as fast as before, but
> > now I see only reasonable delays.
> >
> > Thank you very much for looking into it!!!
>
> And thank you for your thorough investigation that helped to narrow
> down the root cause: under high traffic conditions, LMTP connections
> are cached but never reused, therefore those idle cached connections
> are exhausting server resources.
This is a patch for LMTP connection reuse over UNIX-domain sockets.

        Bugfix (introduced: Postfix 3.0): LMTP connections over
        UNIX-domain sockets were cached but not reused, due to a
        cache lookup key mismatch. Therefore, idle cached connections
        could exhaust LMTP server resources, resulting in two-second
        pauses between email deliveries. This problem was investigated
        by Juliana Rodrigueiro. File: smtp/smtp_connect.c.

Attached are two diffs: one for Postfix versions 3.0-3.5, and one
for Postfix versions 3.4 and 3.5.

These diffs are identical except that in Postfix 3.4 some macros
were renamed to better match their purpose.

Tested with Postfix 3.5. Please give it a try.

        wietse

3.1-3.3.diff (838 bytes) Download Attachment
3.4-3.5.diff (878 bytes) Download Attachment
JRC
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Postfix benchmark: bug or performance regression ?

JRC
On Friday, 29 March 2019 00:49:34 CET Wietse Venema wrote:

>
> This is a patch for LMTP connection reuse over UNIX-domain sockets.
>
>         Bugfix (introduced: Postfix 3.0): LMTP connections over
>         UNIX-domain sockets were cached but not reused, due to a
>         cache lookup key mismatch. Therefore, idle cached connections
>         could exhaust LMTP server resources, resulting in two-second
>         pauses between email deliveries. This problem was investigated
>         by Juliana Rodrigueiro. File: smtp/smtp_connect.c.
>
> These diffs are identical except that in Postfix 3.4 some macros
> were renamed to better match their purpose.
>
> Tested with Postfix 3.5. Please give it a try.
>
> wietse

There is a small problem in the 3.1-3.3.diff though. I believe the variable
"dest" should actually be "path".

I quickly tested with this modifications in postfix 3.3.2 and everything
worked.

Thank you a lot for the patch!

Julie.


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Postfix benchmark: bug or performance regression ?

Wietse Venema
Juliana Rodrigueiro:

> On Friday, 29 March 2019 00:49:34 CET Wietse Venema wrote:
>
> >
> > This is a patch for LMTP connection reuse over UNIX-domain sockets.
> >
> >         Bugfix (introduced: Postfix 3.0): LMTP connections over
> >         UNIX-domain sockets were cached but not reused, due to a
> >         cache lookup key mismatch. Therefore, idle cached connections
> >         could exhaust LMTP server resources, resulting in two-second
> >         pauses between email deliveries. This problem was investigated
> >         by Juliana Rodrigueiro. File: smtp/smtp_connect.c.
> >
> > These diffs are identical except that in Postfix 3.4 some macros
> > were renamed to better match their purpose.
> >
> > Tested with Postfix 3.5. Please give it a try.
> >
> > wietse
>
> There is a small problem in the 3.1-3.3.diff though. I believe the variable
> "dest" should actually be "path".

Oh, right.

> I quickly tested with this modifications in postfix 3.3.2 and everything
> worked.
>
> Thank you a lot for the patch!

Thanks for testing.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Ralf Hildebrandt-2
In reply to this post by Viktor Dukhovni
* Viktor Dukhovni <[hidden email]>:

> > On Mar 28, 2019, at 12:03 PM, Wietse Venema <[hidden email]> wrote:
> >
> > And thank you for your thorough investigation that helped to narrow
> > down the root cause: under high traffic conditions, LMTP connections
> > are cached but never reused, therefore those idle cached connections
> > are exhausting server resources.
>
> Yes, thanks.  I was similarly impressed by the OP's detailed report.
>
> For the record, in case anyone has not read the entire thread, the issue
> is only with LMTP over unix-domain sockets.  LMTP over TCP is not affected.

Ah, excellent. I was already afraid my setup would occasionally suffer
from this effect.

--
[*] sys4 AG

https://sys4.de, +49 (89) 30 90 46 64
Schleißheimer Straße 26/MG, 80333 München
                                           
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
Reply | Threaded
Open this post in threaded view
|

Re: Postfix benchmark: bug or performance regression ?

Wietse Venema
Ralf Hildebrandt:

> * Viktor Dukhovni <[hidden email]>:
> > > On Mar 28, 2019, at 12:03 PM, Wietse Venema <[hidden email]> wrote:
> > >
> > > And thank you for your thorough investigation that helped to narrow
> > > down the root cause: under high traffic conditions, LMTP connections
> > > are cached but never reused, therefore those idle cached connections
> > > are exhausting server resources.
> >
> > Yes, thanks.  I was similarly impressed by the OP's detailed report.
> >
> > For the record, in case anyone has not read the entire thread, the issue
> > is only with LMTP over unix-domain sockets.  LMTP over TCP is not affected.
>
> Ah, excellent. I was already afraid my setup would occasionally suffer
> from this effect.

LMTP over TCP uses the same code paths as SMTP.

        Wietse