AWS timeout

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

AWS timeout

fhare
Hello list,

Bit of a weird one here. I have hosts at AWS sending mail across a
Checkpoint VPN to my main private relay server (it basically serves to relay
mail to O365 for in house applications). The problem is that the sending
client never receives BYE from server after QUIT. The mail goes through and
is delivered ok. This is bad because our timeout is 300s and if you have
anything more than a small amount of mail to send, your connections waiting
to timeout build up at the client and cause problems with applications. Mail
from non-AWS sources does not have this problem across other legs of our
Checkpoint VPN.

I have done packet captures at source, destination, and the two Checkpoint
FWs which are the VPN endpoints. I can clearly see what appears to be
filtering occurring on the QUIT, but I am really struggling to determine
where said filtering is coming from. I can recreate issue with a simple
mailx command, originally it was reported from a java mail client (so
multiple clients exhibit same issue).

And before you ask, we have already had AWS support remove the SMTP
throttling for this host. We have also looked at Checkpoint logging and can
find no evidence it is being throttled at those devices.

Source host is Centos 7. Destination host is Centos 6. Both have latest
kernel and patches for everything. Postfix version is
postfix-2.6.6-8.el6.x86_64.

Thanks for any suggestions.

Postfix -n below
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
data_directory = /var/lib/postfix
debug_peer_level = 2
html_directory = no
inet_interfaces = all
inet_protocols = all
mail_owner = postfix
mailbox_size_limit = 78643200
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
message_size_limit = 78643200
mydestination =
$myhostname,localhost.$mydomain,localhost,idcsmtp2.$mydomain,idcsmtp5.$mydomain,relay.$mydomain
mynetworks = 10.0.0.0/8, 198.112.99.0/24, ***.***.***.***/21
mynetworks_style = class
myorigin = $myhostname
newaliases_path = /usr/bin/newaliases.postfix
notify_classes = bounce, 2bounce, delay, protocol, resource, software
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.6.6/README_FILES
relay_domains = $mydestination, $mydomain, example.com, example.com
relayhost = [example.protection.outlook.com]
sample_directory = /usr/share/doc/postfix-2.6.6/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
unknown_local_recipient_reject_code = 550

source pcap
https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ETWH_87w1AhCtSdpgzeoNwYBitZBm4HecmF0WBR2RHKOfA?e=Ln03Cq 

source Checkpoint VPN endpoint pcap
https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ET3RjKJ_3xdOuvog3fPe8xQBUUUK8qQ7VdOSsnA6oiK4yw?e=vADRze 

destination Checkpoint VPN endpoint pcap
https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EdGaYYr_pzBNiKkji2kb034BvBv3LJ6ooMRjXz-2ddp9NA?e=iDaBr0 

destination postfix pcap
https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EUKXDSXYitxArxc4EeVIELIBVrgufBDnYGIZ3K9BC6GCAQ?e=FnUDNM



--
Sent from: http://postfix.1071664.n5.nabble.com/Postfix-Users-f2.html
Reply | Threaded
Open this post in threaded view
|

Re: AWS timeout

Wietse Venema
fhare:

> Hello list,
>
> Bit of a weird one here. I have hosts at AWS sending mail across a
> Checkpoint VPN to my main private relay server (it basically serves to relay
> mail to O365 for in house applications). The problem is that the sending
> client never receives BYE from server after QUIT. The mail goes through and
> is delivered ok. This is bad because our timeout is 300s and if you have
> anything more than a small amount of mail to send, your connections waiting
> to timeout build up at the client and cause problems with applications. Mail
> from non-AWS sources does not have this problem across other legs of our
> Checkpoint VPN.

If you look at the non-VPN captures, then you will see the following:

- In one trace, we see a client ACK 138, followed by a client packet
  with ".<CR><LF>" (data 443:446, ACK 128, and a timestamp field
  tht is unlike those of al other packets in the stream).

- In the other trace, we see that the ACK and ".<CR><LF>" packets
  are sent as one packet, with a normal timestamp field.

- After this, the TCP connection is messed up, the server keeps
  transmitting "Queued as xxx", and the client keeps transmitting
  QUIT.

This looks like the VPN mucks with TCP and screws up the protocol.

Get a better VPN. If yiu must use the VPN, maybe sending SMTP over TLS
will change the problem.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: AWS timeout

Wietse Venema
Wietse Venema:

> fhare:
> > Hello list,
> >
> > Bit of a weird one here. I have hosts at AWS sending mail across a
> > Checkpoint VPN to my main private relay server (it basically serves to relay
> > mail to O365 for in house applications). The problem is that the sending
> > client never receives BYE from server after QUIT. The mail goes through and
> > is delivered ok. This is bad because our timeout is 300s and if you have
> > anything more than a small amount of mail to send, your connections waiting
> > to timeout build up at the client and cause problems with applications. Mail
> > from non-AWS sources does not have this problem across other legs of our
> > Checkpoint VPN.

There was one typo: 128 instead of 138. I fixed it below.

> If you look at the non-VPN captures, then you will see the following:
>
> - In one trace, we see a client ACK 138, followed by a client packet
>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>   tht is unlike those of al other packets in the stream).
>
> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>   are sent as one packet, with a normal timestamp field.
>
> - After this, the TCP connection is messed up, the server keeps
>   transmitting "Queued as xxx", and the client keeps transmitting
>   QUIT.
>
> This looks like the VPN mucks with TCP and screws up the protocol.
>
> Get a better VPN. If yiu must use the VPN, maybe sending SMTP over TLS
> will change the problem.
>
> Wietse
>
Reply | Threaded
Open this post in threaded view
|

Re: AWS timeout

Wietse Venema
Wietse Venema:

> If you look at the non-VPN captures, then you will see the following:
>
> - In one trace, we see a client ACK 138, followed by a client packet
>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>   tht is unlike those of al other packets in the stream).
>
> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>   are sent as one packet, with a normal timestamp field.
>
> - After this, the TCP connection is messed up, the server keeps
>   transmitting "Queued as xxx", and the client keeps transmitting
>   QUIT.
>
> This looks like the VPN mucks with TCP and screws up the protocol.

I verified this another time. On the SMTP client side, there is one packet,
and on the SMTP server side there are two.

Here is a diff -U output for client-side and server-side pcap files.

-TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
.
+TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
+TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .

On the client side we see that the client sends one TCP packet with
three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
server offset 138, and Timestamp Value 2968348176 and Timestamp
Echo Reply 687501956.

On the server side we see that the server receives two TCP packets.

- One packet with the same ACK for server offset 138, and the same
Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
the client sent, but no data bytes.

- One packet with the same ACK for server offset 138, and the same
three data bytes as the client sent, but with a Timestamp Value of
1 and a Timestamp Echo Reply of 0.

And from here on TCP appears to be stuck. The server repeatedly
sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
sends "QUIT<CR><LF>".

Anyway, something is breaking TCP and it isn't Postfix. Maybe the
VPN is trying to be too smart, maybe something in your TCP stack
is trying to shape traffic and messing up.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: AWS timeout

John Fawcett
On 14/05/2019 01:27, Wietse Venema wrote:

> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
>
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.
>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
> Wietse

The packet sent by the client with the QUIT command never makes it to
Postfix on the server, which therefore remains waiting for 300 seconds
before sending the 421 response for timeout. Specifically, Postfix never
sends a BYE since it doesn't see the QUIT. The TCP traffic after the
QUIT is retransmissions that start on the server due the QUIT packet not
being received.

If the QUIT packet were being lost along the way over the VPN, what I
would expect is for it to show up in the source pcap but not in the
destination pcap, but it is in both and that seems to suggest it is
being lost afterwards.

fhare, can I ask if you took the destination trace on the server itself
or on an intermediate node (for example on a vmware host)?

John



Reply | Threaded
Open this post in threaded view
|

Re: AWS timeout

John Fawcett
In reply to this post by Wietse Venema
On 14/05/2019 01:27, Wietse Venema wrote:

> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.

The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
> Wietse


Reply | Threaded
Open this post in threaded view
|

Re: AWS timeout

fhare
John,

Wow, good one dude! Turning off tcp timestamps on the client instantly fixed it. So I guess I bring this to Checkpoint Support and see what they say, THAT should be fun.

Thanks!

On Tue, May 14, 2019 at 3:02 AM John Fawcett <[hidden email]> wrote:
On 14/05/2019 01:27, Wietse Venema wrote:
> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.

The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
>       Wietse


Reply | Threaded
Open this post in threaded view
|

Ris: AWS timeout

John Fawcett
Frank

Credit to Weitse who noticed the strange TSval in the pcap.

It does look like an issue with Checkpoint. Good thing is this is reproducible.

John


-------- Messaggio originale --------
Oggetto: Re: AWS timeout
Da: Frank Hare
A: John Fawcett
CC: [hidden email]


John,

Wow, good one dude! Turning off tcp timestamps on the client instantly fixed it. So I guess I bring this to Checkpoint Support and see what they say, THAT should be fun.

Thanks!

On Tue, May 14, 2019 at 3:02 AM John Fawcett <[hidden email]> wrote:
On 14/05/2019 01:27, Wietse Venema wrote:
> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.

The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
>       Wietse


Reply | Threaded
Open this post in threaded view
|

Re: Ris: AWS timeout

fhare
Yes, thank you Weitse as well, sorry for the omission!

Best Regards

On Tue, May 14, 2019 at 3:20 PM [hidden email] <[hidden email]> wrote:
Frank

Credit to Weitse who noticed the strange TSval in the pcap.

It does look like an issue with Checkpoint. Good thing is this is reproducible.

John


-------- Messaggio originale --------
Oggetto: Re: AWS timeout
Da: Frank Hare
A: John Fawcett
CC: [hidden email]


John,

Wow, good one dude! Turning off tcp timestamps on the client instantly fixed it. So I guess I bring this to Checkpoint Support and see what they say, THAT should be fun.

Thanks!

On Tue, May 14, 2019 at 3:02 AM John Fawcett <[hidden email]> wrote:
On 14/05/2019 01:27, Wietse Venema wrote:
> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.

The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
>       Wietse


Reply | Threaded
Open this post in threaded view
|

Re: Ris: AWS timeout

Durga Prasad Malyala
In reply to this post by John Fawcett
No surprises here. Weitse ranks along the alltime greats of Computing. 

Cheers/DP


On Wed, May 15, 2019, 00:51 [hidden email] <[hidden email]> wrote:
Frank

Credit to Weitse who noticed the strange TSval in the pcap.

It does look like an issue with Checkpoint. Good thing is this is reproducible.

John


-------- Messaggio originale --------
Oggetto: Re: AWS timeout
Da: Frank Hare
A: John Fawcett
CC: [hidden email]


John,

Wow, good one dude! Turning off tcp timestamps on the client instantly fixed it. So I guess I bring this to Checkpoint Support and see what they say, THAT should be fun.

Thanks!

On Tue, May 14, 2019 at 3:02 AM John Fawcett <[hidden email]> wrote:
On 14/05/2019 01:27, Wietse Venema wrote:
> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.

The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
>       Wietse


Reply | Threaded
Open this post in threaded view
|

Re: Ris: AWS timeout

Ron Wheeler

You never see him interviewed on TV.

If people knew how much of the email travels over the internet as a result of his work, he would be a tech star.

Ron

On 2019-05-14 10:39 p.m., Durga Prasad Malyala wrote:
No surprises here. Weitse ranks along the alltime greats of Computing. 

Cheers/DP


On Wed, May 15, 2019, 00:51 [hidden email] <[hidden email]> wrote:
Frank

Credit to Weitse who noticed the strange TSval in the pcap.

It does look like an issue with Checkpoint. Good thing is this is reproducible.

John


-------- Messaggio originale --------
Oggetto: Re: AWS timeout
Da: Frank Hare
A: John Fawcett
CC: [hidden email]


John,

Wow, good one dude! Turning off tcp timestamps on the client instantly fixed it. So I guess I bring this to Checkpoint Support and see what they say, THAT should be fun.

Thanks!

On Tue, May 14, 2019 at 3:02 AM John Fawcett <[hidden email]> wrote:
On 14/05/2019 01:27, Wietse Venema wrote:
> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>>   with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>>   tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>>   are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>>   transmitting "Queued as xxx", and the client keeps transmitting
>>   QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.

The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
>       Wietse


-- 
Ron Wheeler
Artifact Software
438-345-3369
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Ris: AWS timeout

@lbutlr
On 14 May 2019, at 21:35, Ron Wheeler <[hidden email]> wrote:
> If people knew how much of the email travels over the internet as a result of his work, he would be a tech star.

I really doubt he is interested in notoriety.

--
'You're your own worst enemy, Rincewind,' said the sword.
Rincewind looked up at the grinning men. 'Bet?' --Colour of Magic