EDNS / DANE trouble with Microsoft mail.protection.outlook.com.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

EDNS / DANE trouble with Microsoft mail.protection.outlook.com.

Walter Doekes
Hi there list,

this week we stumbled upon an issue where we could not send mail to
certain domains, for instance [hidden email].

> Nov 16 17:04:08 mail postfix/smtp[13330]: warning: no MX host for umcg.nl has a valid address record
> Nov 16 17:04:08 mail postfix/smtp[13330]: 1D1D21422C2: to=<[hidden email]>, relay=none, delay=2257, delays=2256/0.02/0.52/0, dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for name=umcg-nl.mail.protection.outlook.com type=A: Host not found, try again)

It turned out that this was the cause:

   $ dig MX umcg.nl +short
   10 umcg-nl.mail.protection.outlook.com.

   $ dig NS mail.protection.outlook.com. +short
   ns1-proddns.glbdns.o365filtering.com.
   ns2-proddns.glbdns.o365filtering.com.

   $ dig A umcg-nl.mail.protection.outlook.com.  \
       @ns1-proddns.glbdns.o365filtering.com. +edns +dnssec |
     grep FORMERR
   ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46904
   ;; WARNING: EDNS query returned status FORMERR -
       retry with '+nodnssec +noedns'


Apparently some Microsoft Office 365 mail servers do not support EDNS
and return FORMERR. This propagated through our DNS recursors as
SERVFAIL and caused the lookup to fail.

A temporary workaround was to preheat the DNS cache by manually querying
said domain without EDNS and then flush the queue entries:

   $ dig A umcg-nl.mail.protection.outlook.com. \
       @ns1-proddns.glbdns.o365filtering.com. +noedns +nodnssec +short
   213.199.154.87
   213.199.154.23

   # postqueue -i THE_ITEM

But that's obviously not the right solution.


Some more digging revealed that EDNS was enabled on the query through
`smtp_addr_list`:

      else if (smtp_tls_insecure_mx_policy > TLS_LEV_MAY)
         res_opt = RES_USE_DNSSEC;

The USE_DNSSEC causes the subsequent queries to use USE_EDNS0 with the
DO flag and that killed our interoperability with the Microsoft Office
365 DNS.

The fix was then to lower `smtp_tls_insecure_mx_policy` from 5 (dane) to
1 (may):

     smtp_tls_dane_insecure_mx_policy=may   # default: dane


For the record, this miscommunication started on our servers since the
2nd of November, according to the logs (although I cannot rule out if
anything changed on our side.) Running postfix 3.1.0-3 (Ubuntu Xenial) here.


My questions -- finally:

- Apart from Microsoft upgrading their servers to 2016 and supporting
EDNS, is this issue something postfix should handle?

- Would postfix have handled FORMERR but not SERVFAIL and are my caching
resolvers to blame?

- Should postfix retry the query without EDNS on unexpected errors?

- Should the default smtp_tls_dane_insecure_mx_policy be set to 'dane'?
Or should something more conservative be appropriate if it's able to
cause this kind of miscommunication?



Thanks for your input.

Cheers,
Walter Doekes
OSSO B.V.

Reply | Threaded
Open this post in threaded view
|

Re: EDNS / DANE trouble with Microsoft mail.protection.outlook.com.

Viktor Dukhovni
On Wed, Nov 16, 2016 at 11:15:35PM +0100, Walter Doekes wrote:

> this week we stumbled upon an issue where we could not send mail to certain
> domains, for instance [hidden email].
>
> Nov 16 17:04:08 mail postfix/smtp[13330]: warning:
>     no MX host for umcg.nl has a valid address record
> Nov 16 17:04:08 mail postfix/smtp[13330]: 1D1D21422C2:
>     to=<[hidden email]>, relay=none, delay=2257,
>     delays=2256/0.02/0.52/0, dsn=4.4.3, status=deferred
>     (Host or domain name not found. Name service error
>     for name=umcg-nl.mail.protection.outlook.com type=A:
>     Host not found, try again)
>
> It turned out that this was the cause:
>
>   $ dig MX umcg.nl +short
>   10 umcg-nl.mail.protection.outlook.com.
>
>   $ dig NS mail.protection.outlook.com. +short
>   ns1-proddns.glbdns.o365filtering.com.
>   ns2-proddns.glbdns.o365filtering.com.
>
>   $ dig A umcg-nl.mail.protection.outlook.com.  \
>       @ns1-proddns.glbdns.o365filtering.com. +edns +dnssec |
>     grep FORMERR
>   ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46904
>   ;; WARNING: EDNS query returned status FORMERR -
>       retry with '+nodnssec +noedns'

I can't reproduce your observations using unbound as the local
resolver:


    $ dig +dnssec +ad +noall +comment +cmd +qu +ans +auth +nocl +nottl \
        -t a umcg-nl.mail.protection.outlook.com

    ; <<>> DiG 9.10.4-P2 <<>> +dnssec +ad +noall +comment +cmd +qu +ans +auth +nocl +nottl -t a umcg-nl.mail.protection.outlook.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10562
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags: do; udp: 4096
    ;; QUESTION SECTION:
    ;umcg-nl.mail.protection.outlook.com. IN        A

    ;; ANSWER SECTION:
    umcg-nl.mail.protection.outlook.com. A 213.199.154.23
    umcg-nl.mail.protection.outlook.com. A 213.199.154.87

Postfix will not directly query the remote nameserver, and in indeed
with DANE you're supposed to be configured to *only* query the
local resolver.  What resolver is that?  And how is it configured?

Once the A records come back insecure (AD=0), Postfix will not
query for TLSA records.

> Apparently some Microsoft Office 365 mail servers do not support EDNS and
> return FORMERR. This propagated through our DNS recursors as SERVFAIL and
> caused the lookup to fail.

FORMERR is the expected/standard respose in this case, and your
resolver is expected to fall back to non-EDNS queries.

> Some more digging revealed that EDNS was enabled on the query through
> `smtp_addr_list`:
>
>      else if (smtp_tls_insecure_mx_policy > TLS_LEV_MAY)
>         res_opt = RES_USE_DNSSEC;

That setting affects communication between Postfix and the local
resolver, it does control the options on the next hop query.

> The USE_DNSSEC causes the subsequent queries to use USE_EDNS0 with the DO
> flag and that killed our interoperability with the Microsoft Office 365 DNS.

This analysis is flawed.  Your resolver is not supposed to
unconditionally use EDNS upstream just because the local client is
using EDNS.

> - Apart from Microsoft upgrading their servers to 2016 and supporting EDNS,
> is this issue something postfix should handle?

The problem is your resolver.

> - Would postfix have handled FORMERR but not SERVFAIL and are my caching
> resolvers to blame?

The latter.

> - Should postfix retry the query without EDNS on unexpected errors?

No.

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: EDNS / DANE trouble with Microsoft mail.protection.outlook.com.

Walter Doekes
Awesome Viktor! Thanks for your speedy response.

On 17-11-16 01:17, Viktor Dukhovni wrote:
> On Wed, Nov 16, 2016 at 11:15:35PM +0100, Walter Doekes wrote:
>> this week we stumbled upon an issue where we could not send mail to certain
>> domains, for instance [hidden email].
...
>> It turned out that this was the cause:
...
>>   $ dig A umcg-nl.mail.protection.outlook.com.  \
>>       @ns1-proddns.glbdns.o365filtering.com. +edns +dnssec |
>>     grep FORMERR
>>   ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46904
>>   ;; WARNING: EDNS query returned status FORMERR -
>>       retry with '+nodnssec +noedns'

> I can't reproduce your observations using unbound as the local
> resolver:
>
>     $ dig +dnssec +ad +noall +comment +cmd +qu +ans +auth +nocl +nottl \
> -t a umcg-nl.mail.protection.outlook.com
...
>     umcg-nl.mail.protection.outlook.com. A 213.199.154.23
>     umcg-nl.mail.protection.outlook.com. A 213.199.154.87
>
> Postfix will not directly query the remote nameserver, and in indeed
> with DANE you're supposed to be configured to *only* query the
> local resolver.  What resolver is that?  And how is it configured?
>
> Once the A records come back insecure (AD=0), Postfix will not
> query for TLSA records.

Yes, I was aware that postfix doesn't do the recursion itself. The
@remote-dns in the example was merely to clarify.

You are right. I checked with bind9 as recursor today and it does two
queries: first one that gets the FORMERR and then a second one without
EDNS that succeeds. It'll happily pass along the succesful response to
the original requestor.

That looks like I have my DNS recursor to blame for the problem. It's a
powerdns recursor, version 4.0.0~alpha2 if I'm not mistaken.

I'll be forwarding the issue with the appropriate evidence there if it
hasn't been fixed already.


Thanks again,
Walter Doekes
OSSO B.V.

Reply | Threaded
Open this post in threaded view
|

Re: EDNS / DANE trouble with Microsoft mail.protection.outlook.com.

Viktor Dukhovni
On Thu, Nov 17, 2016 at 10:18:01PM +0100, Walter Doekes wrote:

> >Postfix will not directly query the remote nameserver, and in indeed
> >with DANE you're supposed to be configured to *only* query the
> >local resolver.  What resolver is that?  And how is it configured?
> >
> >Once the A records come back insecure (AD=0), Postfix will not
> >query for TLSA records.
>
> Yes, I was aware that postfix doesn't do the recursion itself. The
> @remote-dns in the example was merely to clarify.
>
> You are right. I checked with bind9 as recursor today and it does two
> queries: first one that gets the FORMERR and then a second one without EDNS
> that succeeds. It'll happily pass along the succesful response to the
> original requestor.
>
> That looks like I have my DNS recursor to blame for the problem. It's a
> powerdns recursor, version 4.0.0~alpha2 if I'm not mistaken.
>
> I'll be forwarding the issue with the appropriate evidence there if it
> hasn't been fixed already.

Please post a summary with the resolution.  If for some (unlikely)
reason you don't get an adequate answer from PowerDNS support, drop
me a note, I can reach out directly to the developers.  Recursors
are expected to behave in the manner you observed with bind9.

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: EDNS / DANE trouble with Microsoft mail.protection.outlook.com.

Walter Doekes
> On Thu, Nov 17, 2016 at 10:18:01PM +0100, Walter Doekes wrote:
>> That looks like I have my DNS recursor to blame for the problem. It's a
>> powerdns recursor, version 4.0.0~alpha2 if I'm not mistaken.
>>
>> I'll be forwarding the issue with the appropriate evidence there if it
>> hasn't been fixed already.
>
> Please post a summary with the resolution.  If for some (unlikely)
> reason you don't get an adequate answer from PowerDNS support, drop
> me a note, I can reach out directly to the developers.  Recursors
> are expected to behave in the manner you observed with bind9.

Okay, today I finally got some time to get this sorted. It appears it was
indeed a bug in pdns-recursor 4.0.0~alpha2-2 on Ubuntu/Xenial.

The bug had been fixed upstream in May 2016:
https://github.com/PowerDNS/pdns/commit/9d534f2a12defc44d2a79291bf34b82e5ee28121

I've filed a bugreport for Ubuntu here:
https://bugs.launchpad.net/ubuntu/+source/pdns-recursor/+bug/1646538

It looks like of Debian and Ubuntu, only Ubuntu/Xenial (LTS) is affected.
All the others run 3.x or 4.0.1 or higher (the latter ones include
9d534f2a and the former didn't appear affected by this).

Thanks again for your prompt reply!

Walter Doekes
OSSO B.V.