Timeout after DATA

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Timeout after DATA

Petr Janda-3
Hi all,
I have got reports about lost mail(not received, im the receiver not the
sender) recently and trying to find out whats going on seems to be beyond me.

Basically a lot of email is lost with "timeout after DATA"

For example:
timeout after DATA (0 bytes) from mail.securepay.com.au[203.89.212.166]

. Supposedly the problem here is that the sending machine has got a firewall
in front of it thats blocking ICMP MUST FRAGMENT. I somewhat could verify
this by trying to ping those machines and indeed pinging them does not
work(so at least ICMP ECHO is blocked).
On the Postfix website it suggests lowering MTU, so I lowered it from 1500 to
1000, but this did not improve the situation at all. Some suggest disabling
PIPELINING, so i did that but it didnt work either.

I took a tcpdump of one of these and attached it to this email. (daria is the
name of the mail server)

Just a note about my setup: The server is running behind a Cisco ADSL Router
that connects to a Cisco switch and the server connects to the switch.

Any Postfix gurus here that could help me figure this problem out?

Thanks,
Petr

postfix_dump.tgz (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Victor Duchovni
On Sat, Nov 15, 2008 at 09:14:07AM +1100, Petr Janda wrote:

> Hi all,
> I have got reports about lost mail(not received, im the receiver not the
> sender) recently and trying to find out whats going on seems to be beyond me.
>
> Basically a lot of email is lost with "timeout after DATA"
>
> For example:
> timeout after DATA (0 bytes) from mail.securepay.com.au[203.89.212.166]
>
> . Supposedly the problem here is that the sending machine has got a firewall
> in front of it thats blocking ICMP MUST FRAGMENT. I somewhat could verify
> this by trying to ping those machines and indeed pinging them does not
> work(so at least ICMP ECHO is blocked).

Or botching window scaling, or getting confused by selectiv ACKs, or ...
Consider disabling window scaling support (not just setting the default
scale to zero). Consider capturing the packet stream outside your
outermost firewall.

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Wietse Venema
Victor Duchovni:

> On Sat, Nov 15, 2008 at 09:14:07AM +1100, Petr Janda wrote:
>
> > Hi all,
> > I have got reports about lost mail(not received, im the receiver not the
> > sender) recently and trying to find out whats going on seems to be beyond me.
> >
> > Basically a lot of email is lost with "timeout after DATA"
> >
> > For example:
> > timeout after DATA (0 bytes) from mail.securepay.com.au[203.89.212.166]
> >
> > . Supposedly the problem here is that the sending machine has got a firewall
> > in front of it thats blocking ICMP MUST FRAGMENT. I somewhat could verify
> > this by trying to ping those machines and indeed pinging them does not
> > work(so at least ICMP ECHO is blocked).
>
> Or botching window scaling, or getting confused by selectiv ACKs, or ...
> Consider disabling window scaling support (not just setting the default
> scale to zero). Consider capturing the packet stream outside your
> outermost firewall.

The Postfix wishlist has an entry to force window scaling off
(by requesting a small TCP send buffer before creating the
listener endpoint). If this works without creating more trouble
than it solves, then I might make it available as a patch for
legacy releases, just like the stress-adaptive behavior patch.

Until then, sysctl is your friend.

*BSD: sysctl -w net.inet.tcp.sack.enable=0
L*n*x: sysctl -w net.ipv4.tcp_sack=0 (and I suppose something
        equivalent if you use Linux IPv6 support).

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
> Until then, sysctl is your friend.
>
> *BSD: sysctl -w net.inet.tcp.sack.enable=0
> L*n*x: sysctl -w net.ipv4.tcp_sack=0 (and I suppose something
> equivalent if you use Linux IPv6 support).
>
> Wietse

Thanks for your suggestions, Ive had both SACK and Window Scaling
turned off for the last hour but unfortunately i dont think its made a
difference either. I think ive exhausted all possible options. I dont
know how would I get a tcpdump upstream from the router. Its not a
failing NIC either, because the same problem occurs on both our
mailservers.

Welcome to more suggestions, before I result to the final working
solution: force the stupid admins to allow ICMP traffic with a shotgun
:)

Petr
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Victor Duchovni
In reply to this post by Wietse Venema
On Fri, Nov 14, 2008 at 06:14:33PM -0500, Wietse Venema wrote:

> Victor Duchovni:
> > On Sat, Nov 15, 2008 at 09:14:07AM +1100, Petr Janda wrote:
> >
> > > Hi all,
> > > I have got reports about lost mail(not received, im the receiver not the
> > > sender) recently and trying to find out whats going on seems to be beyond me.
> > >
> > > Basically a lot of email is lost with "timeout after DATA"
> > >
> > > For example:
> > > timeout after DATA (0 bytes) from mail.securepay.com.au[203.89.212.166]
> > >
> > > . Supposedly the problem here is that the sending machine has got a firewall
> > > in front of it thats blocking ICMP MUST FRAGMENT. I somewhat could verify
> > > this by trying to ping those machines and indeed pinging them does not
> > > work(so at least ICMP ECHO is blocked).
> >
> > Or botching window scaling, or getting confused by selectiv ACKs, or ...
> > Consider disabling window scaling support (not just setting the default
> > scale to zero). Consider capturing the packet stream outside your
> > outermost firewall.
>
> The Postfix wishlist has an entry to force window scaling off
> (by requesting a small TCP send buffer before creating the
> listener endpoint). If this works without creating more trouble
> than it solves, then I might make it available as a patch for
> legacy releases, just like the stress-adaptive behavior patch.
>

In the OPs packet dump, the window scale on the receiving system is 2^0,
but the sending system uses 2^7. Not much we can do about that from the
Postfix side...

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Victor Duchovni
In reply to this post by Petr Janda-3
On Sat, Nov 15, 2008 at 01:01:24PM +1100, Petr Janda wrote:

> Welcome to more suggestions, before I result to the final working
> solution: force the stupid admins to allow ICMP traffic with a shotgun
> :)

I would have done that first... There is a reason why ICMP is part of
the IP protocol suite, and blocking it all for "security" reasons is
rather lame. Some ICMP must be allowed to get through.

--
        Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:[hidden email]?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
In reply to this post by Victor Duchovni
> In the OPs packet dump, the window scale on the receiving system is 2^0,
> but the sending system uses 2^7. Not much we can do about that from the
> Postfix side...
>

So what is the problem here? Should they be same? What can be done
about this? I dont mean as in you but in general how can we get them
to agree?

Thanks,
Petr
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
Hello,
Ive taken another tcpdump, this time with options: window scaling and
SACK disabled.

Please let me know what you think.

Thanks,
Petr

postfix_dump2.tgz (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Wietse Venema
Petr Janda:
> Hello,
> Ive taken another tcpdump, this time with options: window scaling and
> SACK disabled.
>
> Please let me know what you think.

After Postfix replies with "354 End data with <CR><LF>.<CR><LF>"
the client sends something that appears to start in the middle of
an email message.

MsoNormal align=3Dcenter style=3D'text-align:center'><font =<CR><LF>
size=3D3<CR><LF>
face=3D"Times New Roman"><span =<CR><LF>
...

What happens in reality is that a whole chunk of packets is missing.

First we have the TCP three-way handshake:

    04:48:34.741647 203.16.214.214.41558 > 202.76.131.108.25: S
    4022238683:4022238683(0) win 49640 <mss 1460,nop,wscale
    0,nop,nop,sackOK> (DF)

    04:48:34.741733 202.76.131.108.25 > 203.16.214.214.41558: S
    1380291408:1380291408(0) ack 4022238684 win 57344 <mss 1460>
    (DF)

    04:48:34.761488 203.16.214.214.41558 > 202.76.131.108.25: .
    ack 1 win 49640 (DF)

Then we have the SMTP handshake:

    04:48:34.802373 202.76.131.108.25 > 203.16.214.214.41558: P
    1:41(40) ack 1 win 58400 (DF)

        220 daria.webgate.net.au ESMTP Postfix

    04:48:34.822571 203.16.214.214.41558 > 202.76.131.108.25: .
    ack 41 win 49640 (DF)

    04:48:34.824326 203.16.214.214.41558 > 202.76.131.108.25: P
    1:30(29) ack 41 win 49640 (DF)

        EHLO lists.internode.on.net

    04:48:34.824534 202.76.131.108.25 > 203.16.214.214.41558: P
    41:204(163) ack 30 win 58400 (DF)

        250-daria.webgate.net.au
        250-PIPELINING
        250-SIZE 10240000
        250-ETRN
        250-AUTH PLAIN LOGIN
        250-AUTH=PLAIN LOGIN
        250-ENHANCEDSTATUSCODES
        250-8BITMIME
        250 DSN

    04:48:34.845072 203.16.214.214.41558 > 202.76.131.108.25: .
    ack 204 win 49640 (DF)

    04:48:34.847556 203.16.214.214.41558 > 202.76.131.108.25: P
    30:91(61) ack 204 win 49640 (DF)

        MAIL From:<[hidden email]> SIZE=44357

    04:48:34.853278 202.76.131.108.25 > 203.16.214.214.41558: P
    204:218(14) ack 91 win 58400 (DF)

        250 2.1.0 Ok

    04:48:34.875003 203.16.214.214.41558 > 202.76.131.108.25: P
    91:145(54) ack 218 win 49640 (DF)

        RCPT To: <[hidden email]>
        DATA

    04:48:34.915642 202.76.131.108.25 > 203.16.214.214.41558: P
    218:269(51) ack 145 win 58400 (DF)

        250 2.1.5 Ok
        354 End data with <CR><LF>.<CR><LF>

    04:48:34.952747 203.16.214.214.41558 > 202.76.131.108.25: P
    7445:8337(892) ack 269 win 49640 (DF)

        MsoNormal align=3Dcenter style=3D'text-align:center'><font =
        size=3D3
        face=3D"Times New Roman"><span =
        style=3D'font-size:12.0pt'>&nbsp;</span></font></p>
        ....and so on....

Note the jump in byte offsets. The RCPT+DATA packet ends at offset
145, but the "MsoNormal..." is at offset 7445.

What happened with the packets between TCP offsets 145 and 7445?

    04:48:34.952792 202.76.131.108.25 > 203.16.214.214.41558: .
    ack 145 win 58400 (DF)

The SMTP server's TCP stack says that it has not yet received data
beginning at byte offset 145.

Something is badly screwing up TCP, perhaps by throwing away packets
with flags that it does not like.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
> Something is badly screwing up TCP, perhaps by throwing away packets
> with flags that it does not like.

A misconfigured firewall? Seems unlikely as this timeout problem
really happens a lot. Im also going to have a look at the Cisco ADSL
router, maybe try replacing it with another one to make sure. Does
anyone know commands that would allow me to do something like a
tcpdump on a Cisco router? Especially Id like to see if this is ICMP
related at all.

I really have to localize this problem as soon as possible.

Thanks,
Petr
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Wietse Venema
Petr Janda:

> > Something is badly screwing up TCP, perhaps by throwing away packets
> > with flags that it does not like.
>
> A misconfigured firewall? Seems unlikely as this timeout problem
> really happens a lot. Im also going to have a look at the Cisco ADSL
> router, maybe try replacing it with another one to make sure. Does
> anyone know commands that would allow me to do something like a
> tcpdump on a Cisco router? Especially Id like to see if this is ICMP
> related at all.
>
> I really have to localize this problem as soon as possible.

Good luck. I am not going to speculate, and will limit myself to
what is observable.

One odd thing is that TCP data arrives intact until offset 145,
then the next data to arrive is is a non-fragmented packet with
890 bytes starting at offset 7445, so we are missing 7300 bytes.

These 890 bytes are not the end of the message, so I would expect
to see 1460 bytes (the maximal segment size that is agreed upon in
the TCP initial handhake) instead of 890.

If there is a traffic shaper at your end, it may replace your
TCP stack's MSS=1460 announcement by something smaller, like 890.

If you dump the CISCO configuration maybe you'll see that 890 number.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
> If there is a traffic shaper at your end, it may replace your
> TCP stack's MSS=1460 announcement by something smaller, like 890.

Could this also be caused by the ISP?

Petr
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Sahil Tandon
Petr Janda <[hidden email]> wrote:

> > If there is a traffic shaper at your end, it may replace your
> > TCP stack's MSS=1460 announcement by something smaller, like 890.
>
> Could this also be caused by the ISP?

It could be caused by some router that sits and passes packets
between the sending and receiving machine.

--
Sahil Tandon <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Wietse Venema
In reply to this post by Petr Janda-3
Petr Janda:
> > If there is a traffic shaper at your end, it may replace your
> > TCP stack's MSS=1460 announcement by something smaller, like 890.
>
> Could this also be caused by the ISP?

Something is throwing away the first 7300 bytes of the email message.

Depending on how pervasive this behavior is, it happens at your
end, at your ISP, in the middle, at the remote ISP, or at the other
end.  I'll send a 200k email message to postmaster@ and see what's
up.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Wietse Venema
Wietse Venema:

> Petr Janda:
> > > If there is a traffic shaper at your end, it may replace your
> > > TCP stack's MSS=1460 announcement by something smaller, like 890.
> >
> > Could this also be caused by the ISP?
>
> Something is throwing away the first 7300 bytes of the email message.
>
> Depending on how pervasive this behavior is, it happens at your
> end, at your ISP, in the middle, at the remote ISP, or at the other
> end.  I'll send a 200k email message to postmaster@ and see what's
> up.

Well, mail to postmaster@ bounced (user unknown) but the TCP
handshake shows clear signs of tampering:

    19:35:14.389740 168.100.189.2.3865 > 202.76.131.108.25: S
    2828872390:2828872390( 0) win 16384 <mss 1460> (DF)

    19:35:14.678519 202.76.131.108.25 > 168.100.189.2.3865: S
    409513452:409513452(0) ack 2828872391 win 57344 <mss 760>

    19:35:14.678556 168.100.189.2.3865 > 202.76.131.108.25: . ack 1
    win 16720 (DF)

Can you verify that your machine really announces a MSS of 1460?

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
> Can you verify that your machine really announces a MSS of 1460?

Actually you caught me while tempering with the MTU, Ive had it set to
800 for testing and some of the lost mail started getting through, but
with a setting this low IMAP authentication stopped working
altogether. Its back to 1500. Please try again.

Thanks,
Petr
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Wietse Venema
Petr Janda:
> > Can you verify that your machine really announces a MSS of 1460?
>
> Actually you caught me while tempering with the MTU, Ive had it set to
> 800 for testing and some of the lost mail started getting through, but
> with a setting this low IMAP authentication stopped working
> altogether. Its back to 1500. Please try again.

Your machine now announces MSS=1460, which is what I would expect
(1460 = 1500-40, where 40 is the cost of the 20-byte IP and TCP
headers).

I'm going to take a break for the rest of the day.

If this packet loss problem is only happening with a very small
number of sites, then poking from my end is unlikely to reveal
anything useful.

Anyway, you can run tests from gmail and confirm for yourself
if the problem is epidemic.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Petr Janda-3
In reply to this post by Petr Janda-3
by the way postmaster@ wont work. we dont set them up as the email
addresses are stored in ldap and its just a hassle to create an extra
postmaster@ address for each domain we host. if you want maybe try
[hidden email]

Petr

On 11/16/08, Petr Janda <[hidden email]> wrote:

>> Can you verify that your machine really announces a MSS of 1460?
>
> Actually you caught me while tempering with the MTU, Ive had it set to
> 800 for testing and some of the lost mail started getting through, but
> with a setting this low IMAP authentication stopped working
> altogether. Its back to 1500. Please try again.
>
> Thanks,
> Petr
>
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

Barney Desmond
I can't add much to the already-fairly-technical discussion so far, but
I do have a couple of thoughts.

* You mentioned being able to replace the Cisco router with another one.
Could you try a non-Cisco one? Some Cisco gear is infamous for silently
munging SMTP transactions in the name of "security".

* If the internet-facing side of the router is plain ethernet (eg.
connected to your ADSL modem), you could attempt to run a packet capture
there.

* If you're having consistent problems with that one client (or others
as well), you could contact your ISP and ask them to do some packet
capture for you. It might be hard to get through to someone sufficiently
technical though.

* Noone's pointed out the your first packet capture also exhibits the
same "missing data" problem. After the client sends RCPT TO and you
respond with an Ok, the next thing it drops on the wire is "Received:
from srv1.shoppingsquare.com.au" in frame 12. I'm not that confident in
my packet capture-reading skills, but I'm not seeing a discontinuity in
the sequence numbers that Wietse saw in the second capture.

I've poked your server a bit and it appears to have zero tolerance for
errors. Your server should have responded immediately with "221 2.7.0
Error: I can break rules, too. Goodbye.". Instead it just ACK'ed the
packet and kept waiting, this seems odd.

I'm thinking you should contact the other end and ask them what
networking hardware they've got that might be messing things up. At the
same time, tell them to allow some useful ICMP. Their firewall might
well be blocking ICMP Must Fragment, I also see they're dropping
fragmented packets entirely.



signature.asc (258 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Timeout after DATA

DJ Lucas-2
Barney Desmond wrote:

> * Noone's pointed out the your first packet capture also exhibits the
> same "missing data" problem. After the client sends RCPT TO and you
> respond with an Ok, the next thing it drops on the wire is "Received:
> from srv1.shoppingsquare.com.au" in frame 12. I'm not that confident in
> my packet capture-reading skills, but I'm not seeing a discontinuity in
> the sequence numbers that Wietse saw in the second capture.
>
> I've poked your server a bit and it appears to have zero tolerance for
> errors. Your server should have responded immediately with "221 2.7.0
> Error: I can break rules, too. Goodbye.". Instead it just ACK'ed the
> packet and kept waiting, this seems odd.
>  
All of the above is _way_ over my head, but I thought I'd throw out the
obvious suspect anyway.  Silly oddities like above have the stench of
Cisco's attempt to 'fix' things.  I've seen similar problems when 'SMTP
Fixup' is enabled (random sites dropping email because of 'timeout
waiting for data').  Unfortunately, these usually are small sites with
Exchange, in which case it is very easy to identify as the connect
string doesn't show the name, it will instead show '******* 220 ******
blah'.  This has been on by default in every piece of Cisco equipment
that I've messed with (admittedly not many), did you try and disable it yet?

-- DJ Lucas


--
This message has been scanned for viruses and
dangerous content, and is believed to be clean.

12