Possible reasons for "qmgr" loading the system?

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible reasons for "qmgr" loading the system?

Santiago Romero-2

 Hi.

 Today I had a "load average" issue in a postfix mail server (only runs
postfix service). Suddenly, load average started to raise and qmgr
process appeared on top of "top" taking 20-30% of CPU.

top - 18:19:54 up 7 days,  2:03,  2 users,  load average: 4.94, 3.96, 4.02
Tasks: 144 total,   6 running, 138 sleeping,   0 stopped,   0 zombie
Cpu(s): 48.3%us, 50.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  1.0%si,  
0.0%st
Mem:   1035280k total,   999964k used,    35316k free,   149072k buffers
Swap:   750696k total,       88k used,   750608k free,   599308k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  
COMMAND                            
23665 postfix   20   0  5880 2628 1792 S 20.3  0.3  68:11.18
qmgr                                
23662 root      20   0  5392 1732 1400 R  6.0  0.2  20:49.46 master  


 Network traffic was low and we had the normal throughput of emails.

 Queue had only 73 emails in it when the problem happened (just like
now, they are all deferred emails).

 Doing "postfix stop" / "postfix start" solved the problem.

 I case it happens again ...  Where or what should I take a look? At OS
level (disk or network I/O, processes...) I didn't see anything before
the "postfix restart"...

 Thanks.

--
Santiago Romero


Reply | Threaded
Open this post in threaded view
|

Re: Possible reasons for "qmgr" loading the system?

Wietse Venema
Santiago Romero:
>  I case it happens again ...  Where or what should I take a look? At OS
> level (disk or network I/O, processes...) I didn't see anything before
> the "postfix restart"...

Try ``strace -o filename -p pid'' or the equivalent for your OS.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Possible reasons for "qmgr" loading the system?

Santiago Romero-2
Wietse Venema escribió:
> Santiago Romero:
>  
>>  I case it happens again ...  Where or what should I take a look? At OS
>> level (disk or network I/O, processes...) I didn't see anything before
>> the "postfix restart"...
>>    
>
> Try ``strace -o filename -p pid'' or the equivalent for your OS.
>  

 Hi.

 Today happened again in 2 new machines. The last one:


top - 09:44:25 up 19:39,  2 users,  load average: 4.68, 4.87, 4.76
Tasks: 154 total,   6 running, 148 sleeping,   0 stopped,   0 zombie
Cpu(s): 30.7%us, 49.2%sy,  0.0%ni, 11.7%id,  1.3%wa,  1.0%hi,  6.1%si,  
0.0%st

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  
COMMAND                            
26926 postfix   20   0  5840 2552 1792 R   43  0.3 276:51.22 qmgr  


The problem was never appeared in those machines until, yesterday, I
added the following to postfix configuration:

####   /etc/postfix/master.cf
slow     unix  -       -       -       -       -       smtp
  -o syslog_name=postfix-slow


####   /etc/postfix/main.cf
# Special "slow" transport:
slow_destination_recipient_limit=1
slow_destination_concurrency_limit=1
slow_destination_rate_delay=5


 Stracing qmgr process for a while (before restarting postfix), showed
lots of lines like:

time(NULL)                              = 1236156322
epoll_ctl(8, EPOLL_CTL_DEL, 128, {EPOLLIN, {u32=128,
u64=13252642876283682944}}) = 0
fcntl64(128, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(128, F_SETFL, O_RDWR)           = 0
ioctl(128, FIONREAD, [10])              = 0
poll([{fd=128, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
read(128, "status\0000\0\0", 4096)      = 10
gettimeofday({1236156322, 508869}, NULL) = 0
close(128)                              = 0
epoll_ctl(8, EPOLL_CTL_DEL, 129, {EPOLLIN, {u32=129,
u64=13252642876283682945}}) = 0
fcntl64(129, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(129, F_SETFL, O_RDWR)           = 0
ioctl(129, FIONREAD, [10])              = 0
poll([{fd=129, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
read(129, "status\0000\0\0", 4096)      = 10
gettimeofday({1236156322, 510488}, NULL) = 0
close(129)                              = 0
alarm(333)                              = 333
socket(PF_FILE, SOCK_STREAM, 0)         = 13
fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
gettimeofday({1236156322, 513893}, NULL) = 0
fcntl64(13, F_DUPFD, 128)               = 128
close(13)                               = 0
epoll_ctl(8, EPOLL_CTL_ADD, 128, {EPOLLIN, {u32=128,
u64=13834671851822907520}}) = 0
time(NULL)                              = 1236156322
socket(PF_FILE, SOCK_STREAM, 0)         = 13
fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
gettimeofday({1236156322, 515731}, NULL) = 0
fcntl64(13, F_DUPFD, 128)               = 129
close(13)                               = 0
epoll_ctl(8, EPOLL_CTL_ADD, 129, {EPOLLIN, {u32=129,
u64=13834671851822907521}}) = 0
time(NULL)                              = 1236156322
ioctl(3, FIONREAD, [100])               = 0
time(NULL)                              = 1236156322


 My problem seems to be related to my new "slow" transport. I don't know
what I'm doing wrong, because I followed your advice and postfix
manuals... but that's happening since I added my "slow" transport ...

 I'm using postfix-2.5.1-2ubuntu1.2.

--
Santiago Romero


Reply | Threaded
Open this post in threaded view
|

Re: Possible reasons for "qmgr" loading the system?

Victor Duchovni
On Wed, Mar 04, 2009 at 10:15:05AM +0100, Santiago Romero wrote:

> ####   /etc/postfix/master.cf
> slow     unix  -       -       -       -       -       smtp
>  -o syslog_name=postfix-slow
>
>
> ####   /etc/postfix/main.cf
> # Special "slow" transport:
> slow_destination_recipient_limit=1

A really BAD idea, don't do this. Set a recipient limit of at least 2 and
ideally 10 or even the defaul of 50 unless the receiving system enforces
a tighter limit.

> slow_destination_concurrency_limit=1

Unnecessary:

> slow_destination_rate_delay=5

With this you get 5 seconds between deliveries to each destination queue,
but the queues are *per-user* if the recipient limit is 1.

> Stracing qmgr process for a while (before restarting postfix), showed lots
> of lines like:
>
> time(NULL)                              = 1236156322
> epoll_ctl(8, EPOLL_CTL_DEL, 128, {EPOLLIN, {u32=128,
> u64=13252642876283682944}}) = 0
> fcntl64(128, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl64(128, F_SETFL, O_RDWR)           = 0
> ioctl(128, FIONREAD, [10])              = 0
> poll([{fd=128, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
> read(128, "status\0000\0\0", 4096)      = 10
> gettimeofday({1236156322, 508869}, NULL) = 0
> close(128)                              = 0
> epoll_ctl(8, EPOLL_CTL_DEL, 129, {EPOLLIN, {u32=129,
> u64=13252642876283682945}}) = 0
> fcntl64(129, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl64(129, F_SETFL, O_RDWR)           = 0
> ioctl(129, FIONREAD, [10])              = 0
> poll([{fd=129, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
> read(129, "status\0000\0\0", 4096)      = 10
> gettimeofday({1236156322, 510488}, NULL) = 0
> close(129)                              = 0
> alarm(333)                              = 333
> socket(PF_FILE, SOCK_STREAM, 0)         = 13
> fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
> fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
> gettimeofday({1236156322, 513893}, NULL) = 0
> fcntl64(13, F_DUPFD, 128)               = 128
> close(13)                               = 0
> epoll_ctl(8, EPOLL_CTL_ADD, 128, {EPOLLIN, {u32=128,
> u64=13834671851822907520}}) = 0
> time(NULL)                              = 1236156322
> socket(PF_FILE, SOCK_STREAM, 0)         = 13
> fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
> fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
> gettimeofday({1236156322, 515731}, NULL) = 0
> fcntl64(13, F_DUPFD, 128)               = 129
> close(13)                               = 0
> epoll_ctl(8, EPOLL_CTL_ADD, 129, {EPOLLIN, {u32=129,
> u64=13834671851822907521}}) = 0
> time(NULL)                              = 1236156322
> ioctl(3, FIONREAD, [100])               = 0
> time(NULL)                              = 1236156322

Is it the queue manager that's burning CPU? Nothing too interesting
here.

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Santiago Romero-2
Santiago Romero:

>  Stracing qmgr process for a while (before restarting postfix), showed
> lots of lines like:
>
> time(NULL)                              = 1236156322
> epoll_ctl(8, EPOLL_CTL_DEL, 128, {EPOLLIN, {u32=128,
> u64=13252642876283682944}}) = 0
> fcntl64(128, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl64(128, F_SETFL, O_RDWR)           = 0
> ioctl(128, FIONREAD, [10])              = 0
> poll([{fd=128, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
> read(128, "status\0000\0\0", 4096)      = 10
> gettimeofday({1236156322, 508869}, NULL) = 0
> close(128)                              = 0

Some time ago the queue manager made a connection to a delivery
agent, but that part is missing.

The queue manager receives a status of 0, from a delivery agent,
indicating that the delivery agent is ready to accept a delivery
request.  

The code follows this path:

       ... if (attr_scan(stream, ATTR_FLAG_STRICT,
                         ATTR_TYPE_INT, MAIL_ATTR_STATUS, &stat,
                         ATTR_TYPE_END) != 1) {
        /* not applicable... */
    } else {
        return (stat ? DELIVER_STAT_DEFER : 0);
    }

We know that the status is 0, so the result value is 0.

The result value is used in the following statement:

    if (stream == 0 || qmgr_deliver_initial_reply(stream) != 0) {
        /* not applicable */
    }

Then, if you use the new queue manager, this happens:

    if ((entry = qmgr_job_entry_select(transport)) == 0) {
        (void) vstream_fclose(stream);
        return;
    }

If you use the old queue manager, this happens instead:

    if ((queue = qmgr_queue_select(transport)) == 0
        || (entry = qmgr_entry_select(queue)) == 0) {
        (void) vstream_fclose(stream);
        return;
    }

Either way, there queue manager finds that there is no work, and
disconnects from the delivery agent. This part is responsible for
the gettimeofday() and close() calls at the end of the trace above.

You might want to repeat your precise Postfix version at this point,
and which queue manager version is configured in your master.cf.
Current Postfix versions have (qmgr=new, oqmgr=old) in master.cf.
Older Postfix versions have (nqmgr=new, qmgr=old) instead. The
programs are the same except for the job selection algorithm.

If you are using the new queue manager, it is worthwhile to see if
the problem persists when you switch to the old queue manager.

If the problem is with "new" queue manager, the question is why
does qmgr_transport_select() find work that qmgr_job_entry_select()
can't find? The answer is who knows, there is only one person who
understands this code, and it isn't me. qmgr_job_entry_select()
has a number of ways in which it can decide that there is no
suitable work. I could log a warning at this point, because this
is where the process could go into a tight loop.

> epoll_ctl(8, EPOLL_CTL_DEL, 129, {EPOLLIN, {u32=129,
> u64=13252642876283682945}}) = 0
> fcntl64(129, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl64(129, F_SETFL, O_RDWR)           = 0
> ioctl(129, FIONREAD, [10])              = 0
> poll([{fd=129, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
> read(129, "status\0000\0\0", 4096)      = 10
> gettimeofday({1236156322, 510488}, NULL) = 0
> close(129)                              = 0

Same story. We miss the beginning of the story, and witness the
end.

> alarm(333)                              = 333

That't the built-in watchdog timer (1000/3 seconds).

> socket(PF_FILE, SOCK_STREAM, 0)         = 13
> fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
> fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
> gettimeofday({1236156322, 513893}, NULL) = 0
> fcntl64(13, F_DUPFD, 128)               = 128
> close(13)                               = 0
> epoll_ctl(8, EPOLL_CTL_ADD, 128, {EPOLLIN, {u32=128,
> u64=13834671851822907520}}) = 0
> time(NULL)                              = 1236156322
> socket(PF_FILE, SOCK_STREAM, 0)         = 13
> fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
> fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
> gettimeofday({1236156322, 515731}, NULL) = 0
> fcntl64(13, F_DUPFD, 128)               = 129
> close(13)                               = 0
> epoll_ctl(8, EPOLL_CTL_ADD, 129, {EPOLLIN, {u32=129,
> u64=13834671851822907521}}) = 0
> time(NULL)                              = 1236156322
> ioctl(3, FIONREAD, [100])               = 0
> time(NULL)                              = 1236156322

In qmgr_active_drain(), qmgr_transport_select() has found work for
the slow transport, and qmgr_transport_alloc() has made a conection
to a slow delivery agent.

The queue manager is now waiting for the delivery agent to report
that it is ready (with status = 0).

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Victor Duchovni
Victor Duchovni:
> Is it the queue manager that's burning CPU? Nothing too interesting
> here.

Yes, according to this:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26926 postfix   20   0  5840 2552 1792 R   43  0.3 276:51.22 qmgr  

There needs to be a safety check for the case that qmgr_job_entry_select()
decides that there is no eligible work, otherwise qmgr_transport_select()
could go into a loop.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Victor Duchovni
Victor Duchovni:
> > slow_destination_recipient_limit=1
> > slow_destination_concurrency_limit=1

I wonder if the problem recurs when these are changed. But let's
first swap new and old queue managers.

        Wietse
Reply | Threaded
Open this post in threaded view
|

PATCH: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Santiago Romero-2
Santiago Romero:

> Wietse Venema escribi?:
> > Santiago Romero:
> >  
> >>  I case it happens again ...  Where or what should I take a look? At OS
> >> level (disk or network I/O, processes...) I didn't see anything before
> >> the "postfix restart"...
> >>    
> >
> > Try ``strace -o filename -p pid'' or the equivalent for your OS.
> >  
>
>  Hi.
>
>  Today happened again in 2 new machines. The last one:
>
>
> top - 09:44:25 up 19:39,  2 users,  load average: 4.68, 4.87, 4.76
> Tasks: 154 total,   6 running, 148 sleeping,   0 stopped,   0 zombie
> Cpu(s): 30.7%us, 49.2%sy,  0.0%ni, 11.7%id,  1.3%wa,  1.0%hi,  6.1%si,  
> 0.0%st
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  
> COMMAND                            
> 26926 postfix   20   0  5840 2552 1792 R   43  0.3 276:51.22 qmgr  
>
>
> The problem was never appeared in those machines until, yesterday, I
> added the following to postfix configuration:
>
> ####   /etc/postfix/master.cf
> slow     unix  -       -       -       -       -       smtp
>   -o syslog_name=postfix-slow
>
>
> ####   /etc/postfix/main.cf
> # Special "slow" transport:
> slow_destination_recipient_limit=1
> slow_destination_concurrency_limit=1
> slow_destination_rate_delay=5

OK, leave the above settings and see if this helps (Postfix 2.5 or later).

I have not been able to reproduce the problem, but there was
some bogosity with the handling of _destination_rate_delay.

The only reason I know for lots of qmgr CPU usage is when all
mail is being delivered to a "discard" transport. When all mail
is bounced or deferred you'd have lots of disk activity that
causes qmgr to be slowed down.

        Wietse

diff --exclude=man --exclude=html --exclude=README_FILES --exclude=.indent.pro --exclude=Makefile.in -cr src/qmgr/qmgr_entry.c- src/qmgr/qmgr_entry.c
*** src/qmgr/qmgr_entry.c- Fri Dec 14 17:47:21 2007
--- src/qmgr/qmgr_entry.c Wed Mar  4 16:04:21 2009
***************
*** 299,304 ****
--- 299,317 ----
      }
 
      /*
+      * Suspend a rate-limited queue, so that mail trickles out.
+      */
+     if (which == QMGR_QUEUE_BUSY && transport->rate_delay > 0) {
+ if (queue->window > 1)
+    msg_panic("%s: queue %s/%s: window %d > 1 on rate-limited service",
+      myname, transport->name, queue->name, queue->window);
+ if (QMGR_QUEUE_THROTTLED(queue)) /* XXX */
+    qmgr_queue_unthrottle(queue);
+ if (QMGR_QUEUE_READY(queue))
+    qmgr_queue_suspend(queue, transport->rate_delay);
+     }
+
+     /*
       * If the queue was blocking some of the jobs on the job list, check if
       * the concurrency limit has lifted. If there are still some pending
       * deliveries, give it a try and unmark all transport blockers at once.
***************
*** 336,354 ****
       */
      if (which == QMGR_QUEUE_BUSY)
  queue->last_done = event_time();
-
-     /*
-      * Suspend a rate-limited queue, so that mail trickles out.
-      */
-     if (which == QMGR_QUEUE_BUSY && transport->rate_delay > 0) {
- if (queue->window > 1)
-    msg_panic("%s: queue %s/%s: window %d > 1 on rate-limited service",
-      myname, transport->name, queue->name, queue->window);
- if (QMGR_QUEUE_THROTTLED(queue)) /* XXX */
-    qmgr_queue_unthrottle(queue);
- if (QMGR_QUEUE_READY(queue))
-    qmgr_queue_suspend(queue, transport->rate_delay);
-     }
 
      /*
       * When the in-core queue for this site is empty and when this site is
--- 349,354 ----
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Santiago Romero-2

 > Wietse Venema wrote:
 > You might want to repeat your precise Postfix version at this point,
 > and which queue manager version is configured in your master.cf.
 > Current Postfix versions have (qmgr=new, oqmgr=old) in master.cf.
 > Older Postfix versions have (nqmgr=new, qmgr=old) instead. The
 > programs are the same except for the job selection algorithm.

root@egeo:~# postconf mail_version
mail_version = 2.5.1

root@egeo:~# grep -i qmgr /etc/postfix/master.cf
qmgr      fifo  n       -       n       300     1       qmgr
#qmgr     fifo  n       -       -       300     1       oqmgr


 > If you are using the new queue manager, it is worthwhile to see if
 > the problem persists when you switch to the old queue manager.

 It seems I'm using the new one...

 > OK, leave the above settings and see if this helps
 > (Postfix 2.5 or later).
 >
 > I have not been able to reproduce the problem, but there was
 > some bogosity with the handling of _destination_rate_delay.
 >
 >
 > diff --exclude=man --exclude=html --exclude=README_FILES
 > --exclude=.indent.pro --exclude=Makefile.in -cr
 > src/qmgr/qmgr_entry.c- src/qmgr/qmgr_entry.c

 Well, I'm using postfix's ubuntu package, so it's not compiled from
source code because I need all my ~=100 Linux machines to be easily
updatable (apt-get update && apt-get upgrade).

 In this case, I'm going to recompile .deb source package including your
patch to see if that solves the problem ...

 Please, allow me a couple of days to recompile / install it (it's a
production system, I need to find a working window with customers). I'll
inform you in this list if the problem happens again or if the patch
seemed to fix the problem. Do you want any kind of aditional change /
logging / config to make the "problem" more easy to happen?

 (I mean, setting rate_ values higher or lower so that the problem
reproduces again faster, because it passed 5 days between the last 2
times qmgr ate the CPU...).

 Thanks.

--
Santiago Romero


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Wietse Venema
Santiago Romero:
>  (I mean, setting rate_ values higher or lower so that the problem
> reproduces again faster, because it passed 5 days between the last 2
> times qmgr ate the CPU...).

Just run the same test.

Thanks,

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Victor Duchovni
In reply to this post by Santiago Romero-2
On Thu, Mar 05, 2009 at 12:20:06PM +0100, Santiago Romero wrote:

> Well, I'm using postfix's ubuntu package, so it's not compiled from source
> code because I need all my ~=100 Linux machines to be easily updatable
> (apt-get update && apt-get upgrade).
>
> In this case, I'm going to recompile .deb source package including your
> patch to see if that solves the problem ...
>
> Please, allow me a couple of days to recompile / install it (it's a
> production system, I need to find a working window with customers). I'll
> inform you in this list if the problem happens again or if the patch seemed
> to fix the problem. Do you want any kind of aditional change / logging /
> config to make the "problem" more easy to happen?

Please wait for an updated patch, we believe we have identified the
cause and reproduced the symptoms (in that order). I have a candidate
patch, but I expect Wietse will send an updated more polished version
in the not too distant future.

The issue found applies only to "rate-limited" transports, if you are
not using such transports, you don't need the patch. The patch ensures
that work done at the completion of a delivery with a "normal" transport
is correctly split between "before suspend" and "after resume".

The original 2.5.x code is correct for "oqmgr", but not for "qmgr"
(aka "nqmgr"), which requires additional internal state adjustments
when destinations are blocked and unblocked.

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Santiago Romero-2

> Please wait for an updated patch, we believe we have identified the
> cause and reproduced the symptoms (in that order). I have a candidate
> patch, but I expect Wietse will send an updated more polished version
> in the not too distant future.
>  

 Ok, I'll wait for it. I'm going to roll back to "ubuntu packages" (I
already applied the patch and was testing it).

 
> The original 2.5.x code is correct for "oqmgr", but not for "qmgr"
> (aka "nqmgr"), which requires additional internal state adjustments
> when destinations are blocked and unblocked

 I've changed to "oqmgr" in master.cf for the machine that uses that
special "slow" transport. Would I notice any difference in postfix
behaviour because of using "oqmgr" instead of "qmgr" (less performance
or something like that)?

 Thanks.

--
Santiago Romero

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Victor Duchovni
On Thu, Mar 05, 2009 at 04:21:01PM +0100, Santiago Romero wrote:

>
>> Please wait for an updated patch, we believe we have identified the
>> cause and reproduced the symptoms (in that order). I have a candidate
>> patch, but I expect Wietse will send an updated more polished version
>> in the not too distant future.
>>  
>
> Ok, I'll wait for it. I'm going to roll back to "ubuntu packages" (I
> already applied the patch and was testing it).
>
>> The original 2.5.x code is correct for "oqmgr", but not for "qmgr"
>> (aka "nqmgr"), which requires additional internal state adjustments
>> when destinations are blocked and unblocked
>
> I've changed to "oqmgr" in master.cf for the machine that uses that special
> "slow" transport. Would I notice any difference in postfix behaviour
> because of using "oqmgr" instead of "qmgr" (less performance or something
> like that)?

With "oqmgr", "list" messages with a lot (multiple thousands to perhaps
hundreds of thousands) of recipients can dominate the queue, and delay
small messages. Also if you don't define "relay_domains" correctly,
on a high-volume border gateway outbound "smtp" traffic can "starve"
inbound "smtp" traffic when both use the same transport, especially
if outbound traffic exhibits high latency.

    - Avoid mixing (very large) "list" mail with regular traffic in
      the same queue with "oqmgr"

    - Avoid delivering inbound/outbound traffic via the same transport.

    - Avoid outbound congestion caused by lack of recipient validation.

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Santiago Romero-2
Santiago Romero:
>
> > Please wait for an updated patch, we believe we have identified the
> > cause and reproduced the symptoms (in that order). I have a candidate
> > patch, but I expect Wietse will send an updated more polished version
> > in the not too distant future.
> >  
>
>  Ok, I'll wait for it. I'm going to roll back to "ubuntu packages" (I
> already applied the patch and was testing it).

It will be later today. I don't have much time so I want to have
it really right the first time. Code that is right takes more work
than code that works.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Jerry-124
On Thu, 5 Mar 2009 13:03:11 -0500 (EST)
[hidden email] (Wietse Venema) wrote:

>It will be later today. I don't have much time so I want to have
>it really right the first time. Code that is right takes more work
>than code that works.

Reminds me of a plaque I have in my office.

        There is never enough time to do it right; however,
        there is always enough time to do it over.

                anonymous

--
Gerard
[hidden email]

TO REPORT A PROBLEM see http://www.postfix.org/DEBUG_README.html#mail
TO (UN)SUBSCRIBE see http://www.postfix.org/lists.html

BYTE editors are people who separate the wheat from the chaff, and then
carefully print the chaff.

signature.asc (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Wietse Venema
Wietse Venema:

> Santiago Romero:
> >
> > > Please wait for an updated patch, we believe we have identified the
> > > cause and reproduced the symptoms (in that order). I have a candidate
> > > patch, but I expect Wietse will send an updated more polished version
> > > in the not too distant future.
> > >  
> >
> >  Ok, I'll wait for it. I'm going to roll back to "ubuntu packages" (I
> > already applied the patch and was testing it).
>
> It will be later today. I don't have much time so I want to have
> it really right the first time. Code that is right takes more work
> than code that works.

To apply this patch, cd into the Postfix-2.5.* top-level source
directory and execute:

$ patch < thismessage

We were able to reproduce the scheduler looping problem, and it
does not recur with the patched version.

        Wietse

diff -cr /var/tmp/postfix-2.5.6/src/oqmgr/qmgr_transport.c src/oqmgr/qmgr_transport.c
*** /var/tmp/postfix-2.5.6/src/oqmgr/qmgr_transport.c Sun Dec  2 13:13:26 2007
--- src/oqmgr/qmgr_transport.c Thu Mar  5 16:06:43 2009
***************
*** 286,291 ****
--- 286,293 ----
     continue;
  need = xport->pending + 1;
  for (queue = xport->queue_list.next; queue; queue = queue->peers.next) {
+    if (QMGR_QUEUE_READY(queue) == 0)
+ continue;
     if ((need -= MIN5af51743e4eef(queue->window - queue->busy_refcount,
   queue->todo_refcount)) <= 0) {
  QMGR_LIST_ROTATE(qmgr_transport_list, xport);
diff -cr /var/tmp/postfix-2.5.6/src/qmgr/qmgr.h src/qmgr/qmgr.h
*** /var/tmp/postfix-2.5.6/src/qmgr/qmgr.h Sat Dec  8 11:01:59 2007
--- src/qmgr/qmgr.h Thu Mar  5 16:36:32 2009
***************
*** 436,441 ****
--- 436,442 ----
 
  extern QMGR_ENTRY *qmgr_job_entry_select(QMGR_TRANSPORT *);
  extern QMGR_PEER *qmgr_peer_select(QMGR_JOB *);
+ extern void qmgr_job_blocker_update(QMGR_QUEUE *);
 
  extern QMGR_JOB *qmgr_job_obtain(QMGR_MESSAGE *, QMGR_TRANSPORT *);
  extern void qmgr_job_free(QMGR_JOB *);
diff -cr /var/tmp/postfix-2.5.6/src/qmgr/qmgr_entry.c src/qmgr/qmgr_entry.c
*** /var/tmp/postfix-2.5.6/src/qmgr/qmgr_entry.c Fri Dec 14 17:47:21 2007
--- src/qmgr/qmgr_entry.c Thu Mar  5 16:29:46 2009
***************
*** 299,327 ****
      }
 
      /*
!      * If the queue was blocking some of the jobs on the job list, check if
!      * the concurrency limit has lifted. If there are still some pending
!      * deliveries, give it a try and unmark all transport blockers at once.
!      * The qmgr_job_entry_select() will do the rest. In either case make sure
!      * the queue is not marked as a blocker anymore, with extra handling of
!      * queues which were declared dead.
       *
!      * Note that changing the blocker status also affects the candidate cache.
!      * Most of the cases would be automatically recognized by the current job
!      * change, but we play safe and reset the cache explicitly below.
!      *
!      * Keeping the transport blocker tag odd is an easy way to make sure the tag
!      * never matches jobs that are not explicitly marked as blockers.
       */
!     if (queue->blocker_tag == transport->blocker_tag) {
! if (queue->window > queue->busy_refcount && queue->todo.next != 0) {
!    transport->blocker_tag += 2;
!    transport->job_current = transport->job_list.next;
!    transport->candidate_cache_current = 0;
! }
! if (queue->window > queue->busy_refcount || QMGR_QUEUE_THROTTLED(queue))
!    queue->blocker_tag = 0;
      }
 
      /*
       * When there are no more entries for this peer, discard the peer
--- 299,323 ----
      }
 
      /*
!      * We implement a rate-limited queue by emulating a slow delivery
!      * channel. We insert the artificial delays with qmgr_queue_suspend().
       *
!      * When a queue is suspended, we must postpone any job scheduling decisions
!      * until the queue is resumed. Otherwise, we make those decisions now.
!      * The job scheduling decisions are made by qmgr_job_blocker_update().
       */
!     if (which == QMGR_QUEUE_BUSY && transport->rate_delay > 0) {
! if (queue->window > 1)
!    msg_panic("%s: queue %s/%s: window %d > 1 on rate-limited service",
!      myname, transport->name, queue->name, queue->window);
! if (QMGR_QUEUE_THROTTLED(queue)) /* XXX */
!    qmgr_queue_unthrottle(queue);
! if (QMGR_QUEUE_READY(queue))
!    qmgr_queue_suspend(queue, transport->rate_delay);
      }
+     if (!QMGR_QUEUE_SUSPENDED(queue)
+ && queue->blocker_tag == transport->blocker_tag)
+ qmgr_job_blocker_update(queue);
 
      /*
       * When there are no more entries for this peer, discard the peer
***************
*** 336,354 ****
       */
      if (which == QMGR_QUEUE_BUSY)
  queue->last_done = event_time();
-
-     /*
-      * Suspend a rate-limited queue, so that mail trickles out.
-      */
-     if (which == QMGR_QUEUE_BUSY && transport->rate_delay > 0) {
- if (queue->window > 1)
-    msg_panic("%s: queue %s/%s: window %d > 1 on rate-limited service",
-      myname, transport->name, queue->name, queue->window);
- if (QMGR_QUEUE_THROTTLED(queue)) /* XXX */
-    qmgr_queue_unthrottle(queue);
- if (QMGR_QUEUE_READY(queue))
-    qmgr_queue_suspend(queue, transport->rate_delay);
-     }
 
      /*
       * When the in-core queue for this site is empty and when this site is
--- 332,337 ----
diff -cr /var/tmp/postfix-2.5.6/src/qmgr/qmgr_job.c src/qmgr/qmgr_job.c
*** /var/tmp/postfix-2.5.6/src/qmgr/qmgr_job.c Tue Nov  7 11:34:07 2006
--- src/qmgr/qmgr_job.c Thu Mar  5 16:43:36 2009
***************
*** 18,23 ****
--- 18,26 ----
  /*
  /* QMGR_ENTRY *qmgr_job_entry_select(transport)
  /* QMGR_TRANSPORT *transport;
+ /*
+ /* void qmgr_job_blocker_update(queue)
+ /* QMGR_QUEUE *queue;
  /* DESCRIPTION
  /* These routines add/delete/manipulate per-transport jobs.
  /* Each job corresponds to a specific transport and message.
***************
*** 38,43 ****
--- 41,51 ----
  /* If necessary, an attempt to read more recipients into core is made.
  /* This can result in creation of more job, queue and entry structures.
  /*
+ /* qmgr_job_blocker_update() updates the status of blocked
+ /* jobs after a decrease in the queue's concurrency level,
+ /* after the queue is throttled, or after the queue is resumed
+ /* from suspension.
+ /*
  /* qmgr_job_move_limits() takes care of proper distribution of the
  /* per-transport recipients limit among the per-transport jobs.
  /* Should be called whenever a job's recipient slot becomes available.
***************
*** 937,939 ****
--- 945,980 ----
      transport->job_current = 0;
      return (0);
  }
+
+ /* qmgr_job_blocker_update - update "blocked job" status */
+
+ void     qmgr_job_blocker_update(QMGR_QUEUE *queue)
+ {
+     QMGR_TRANSPORT *transport = queue->transport;
+
+     /*
+      * If the queue was blocking some of the jobs on the job list, check if
+      * the concurrency limit has lifted. If there are still some pending
+      * deliveries, give it a try and unmark all transport blockers at once.
+      * The qmgr_job_entry_select() will do the rest. In either case make sure
+      * the queue is not marked as a blocker anymore, with extra handling of
+      * queues which were declared dead.
+      *
+      * Note that changing the blocker status also affects the candidate cache.
+      * Most of the cases would be automatically recognized by the current job
+      * change, but we play safe and reset the cache explicitly below.
+      *
+      * Keeping the transport blocker tag odd is an easy way to make sure the tag
+      * never matches jobs that are not explicitly marked as blockers.
+      */
+     if (queue->blocker_tag == transport->blocker_tag) {
+ if (queue->window > queue->busy_refcount && queue->todo.next != 0) {
+    transport->blocker_tag += 2;
+    transport->job_current = transport->job_list.next;
+    transport->candidate_cache_current = 0;
+ }
+ if (queue->window > queue->busy_refcount || QMGR_QUEUE_THROTTLED(queue))
+    queue->blocker_tag = 0;
+     }
+ }
+
diff -cr /var/tmp/postfix-2.5.6/src/qmgr/qmgr_queue.c src/qmgr/qmgr_queue.c
*** /var/tmp/postfix-2.5.6/src/qmgr/qmgr_queue.c Sat Dec  8 09:59:34 2007
--- src/qmgr/qmgr_queue.c Thu Mar  5 17:35:24 2009
***************
*** 66,72 ****
  /* "slow open" mode, and eliminates the "thundering herd" problem.
  /*
  /* qmgr_queue_suspend() suspends delivery for this destination
! /* briefly.
  /* DIAGNOSTICS
  /* Panic: consistency check failure.
  /* LICENSE
--- 66,76 ----
  /* "slow open" mode, and eliminates the "thundering herd" problem.
  /*
  /* qmgr_queue_suspend() suspends delivery for this destination
! /* briefly. This function invalidates any scheduling decisions
! /* that are based on the present queue's concurrency window.
! /* To compensate for work skipped by qmgr_entry_done(), the
! /* status of blocker jobs is re-evaluated after the queue is
! /* resumed.
  /* DIAGNOSTICS
  /* Panic: consistency check failure.
  /* LICENSE
***************
*** 152,160 ****
--- 156,175 ----
      /*
       * Every event handler that leaves a queue in the "ready" state should
       * remove the queue when it is empty.
+      *
+      * XXX Do not omit the redundant test below. It is here to simplify code
+      * consistency checks. The check is trivially eliminated by the compiler
+      * optimizer. There is no need to sacrifice code clarity for the sake of
+      * performance.
+      *
+      * XXX Do not expose the blocker job logic here. Rate-limited queues are not
+      * a performance-critical feature. Here, too, there is no need to sacrifice
+      * code clarity for the sake of performance.
       */
      if (QMGR_QUEUE_READY(queue) && queue->todo.next == 0 && queue->busy.next == 0)
  qmgr_queue_done(queue);
+     else
+ qmgr_job_blocker_update(queue);
  }
 
  /* qmgr_queue_suspend - briefly suspend a destination */
diff -cr /var/tmp/postfix-2.5.6/src/qmgr/qmgr_transport.c src/qmgr/qmgr_transport.c
*** /var/tmp/postfix-2.5.6/src/qmgr/qmgr_transport.c Sun Dec  2 12:53:17 2007
--- src/qmgr/qmgr_transport.c Thu Mar  5 15:08:44 2009
***************
*** 291,296 ****
--- 291,298 ----
     continue;
  need = xport->pending + 1;
  for (queue = xport->queue_list.next; queue; queue = queue->peers.next) {
+    if (QMGR_QUEUE_READY(queue) == 0)
+ continue;
     if ((need -= MIN5af51743e4eef(queue->window - queue->busy_refcount,
   queue->todo_refcount)) <= 0) {
  QMGR_LIST_ROTATE(qmgr_transport_list, xport, peers);
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Santiago Romero-2

> To apply this patch, cd into the Postfix-2.5.* top-level source
> directory and execute:
>
> $ patch < thismessage
>
> We were able to reproduce the scheduler looping problem, and it
> does not recur with the patched version

 A question ... what' the way to make this patch to be included in
Ubuntu Server "postfix" packages?

 I mean, should I submit your message+patch to the package maintainers
of Ubuntu / Debian / Redhat so that new "postfix" packages with "the
bug" corrected are released as updates for users?

 Or ... you just publish the patch / bug somewhere and then the package
maintainers update their sources automatically without we or you needing
to contact them? :?

 I can patch postfix's sources, but then I loose Ubuntu package security
updates and will force me to maintain postfix from sources since this
moment. The best way would be your patch to be integrated in postfix and
new "security" postfix packages to be released by package maintainers,
but I don't know how to "force" that.

 Thanks.

--
Santiago Romero


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Jerry-124
On Fri, 06 Mar 2009 10:07:26 +0100
Santiago Romero <[hidden email]> wrote:

> A question ... what' the way to make this patch to be included in
>Ubuntu Server "postfix" packages?
>
> I mean, should I submit your message+patch to the package maintainers
>of Ubuntu / Debian / Redhat so that new "postfix" packages with "the
>bug" corrected are released as updates for users?
>
> Or ... you just publish the patch / bug somewhere and then the
> package
>maintainers update their sources automatically without we or you
>needing to contact them? :?
>
> I can patch postfix's sources, but then I loose Ubuntu package
> security
>updates and will force me to maintain postfix from sources since this
>moment. The best way would be your patch to be integrated in postfix
>and new "security" postfix packages to be released by package
>maintainers, but I don't know how to "force" that.
In a perfect world, the program maintainers would know about the patch
and take steps to correct their package/port or whatever. You might
want to contact the maintainer of Postfix for your Distro and see if
they are planning on updating the package/port. Usually, they do get a
little annoyed if you start bugging them 5 seconds after the patch is
released. Some of them actually have day jobs.

--
Gerard
[hidden email]

TO REPORT A PROBLEM see http://www.postfix.org/DEBUG_README.html#mail
TO (UN)SUBSCRIBE see http://www.postfix.org/lists.html

Cheese -- milk's leap toward immortality.

        Clifton Fadiman, "Any Number Can Play"

signature.asc (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Santiago Romero-2
Gerard escribió:
> In a perfect world, the program maintainers would know about the patch
> and take steps to correct their package/port or whatever. You might
> want to contact the maintainer of Postfix for your Distro and see if
> they are planning on updating the package/port. Usually, they do get a
> little annoyed if you start bugging them 5 seconds after the patch is
> released. Some of them actually have day jobs

 Well, I'm not planning to bug them with the patch. I don't know if the
integration of the patch with the current package versions is automatic
or author / bug discoverers must or should notify them to package
maintainers...

 That's what I was asking: if the process is automatic or should I
notify / help in any way.

--
Santiago Romero


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: Possible reasons for "qmgr" loading the system?

Wietse Venema
In reply to this post by Santiago Romero-2
Santiago Romero:

> > To apply this patch, cd into the Postfix-2.5.* top-level source
> > directory and execute:
> >
> > $ patch < thismessage
> >
> > We were able to reproduce the scheduler looping problem, and it
> > does not recur with the patched version
>
>  A question ... what' the way to make this patch to be included in
> Ubuntu Server "postfix" packages?

I will release this as part of Postfix 2.5.7.

Meanwhile, you can use oqmgr and it it will an all likelihood
perform just as well.

>  I mean, should I submit your message+patch to the package maintainers
> of Ubuntu / Debian / Redhat so that new "postfix" packages with "the
> bug" corrected are released as updates for users?
>
>  Or ... you just publish the patch / bug somewhere and then the package
> maintainers update their sources automatically without we or you needing
> to contact them? :?
>
>  I can patch postfix's sources, but then I loose Ubuntu package security
> updates and will force me to maintain postfix from sources since this
> moment. The best way would be your patch to be integrated in postfix and
> new "security" postfix packages to be released by package maintainers,
> but I don't know how to "force" that.

I have no control over vendors and distributors.

        Wietse