available: multiple deliveries per TLS-encrypted connection

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Viktor Dukhovni
On Tue, Jun 19, 2018 at 01:22:53PM -0400, Wietse Venema wrote:

> Unfortunately, this would be suboptimal when a site has muliple MX hosts
> (It may end up making connections to each of them).
>
> Viktor's suggestion to skip the dane cache makes more sense.
>
> Viktor, cache wshould terminate after "postfix reload".

In that case, perhaps the below will work?

diff --git a/src/smtp/smtp_tls_policy.c b/src/smtp/smtp_tls_policy.c
index 13735b21..b5f72376 100644
--- a/src/smtp/smtp_tls_policy.c
+++ b/src/smtp/smtp_tls_policy.c
@@ -824,6 +824,20 @@ static void dane_init(SMTP_TLS_POLICY *tls, SMTP_ITERATOR *iter)
  dane_incompat(tls, iter, NONDANE_DEST, "non DNSSEC destination");
  return;
     }
+
+    /*
+     * Without the context of a particular MX host, the nexthop is merely a
+     * opportunistic "candidate" for DANE policy.  Just return "may" for now.
+     *
+     * XXX: This state should only be reached from smtp_reuse_session(),
+     * perhaps add a safety flag in iter that is co-requisite with a NULL
+     * iter->rr, or else panic?
+     */
+    if (!iter->rr) {
+ tls->level = TLS_LEV_MAY;
+ return;
+    }
+
     /* When TLSA lookups fail, we defer the message */
     if ((dane = tls_dane_resolve(iter->port, "tcp", iter->rr,
  var_smtp_tls_force_tlsa)) == 0) {

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Viktor Dukhovni


> On Jun 19, 2018, at 1:29 PM, Viktor Dukhovni <[hidden email]> wrote:
>
> In that case, perhaps the below will work?
>
> diff --git a/src/smtp/smtp_tls_policy.c b/src/smtp/smtp_tls_policy.c
> index 13735b21..b5f72376 100644
> --- a/src/smtp/smtp_tls_policy.c
> +++ b/src/smtp/smtp_tls_policy.c
> @@ -824,6 +824,20 @@ static void dane_init(SMTP_TLS_POLICY *tls, SMTP_ITERATOR *iter)
> dane_incompat(tls, iter, NONDANE_DEST, "non DNSSEC destination");
> return;
>     }
> +
> +    /*
> +     * Without the context of a particular MX host, the nexthop is merely a
> +     * opportunistic "candidate" for DANE policy.  Just return "may" for now.
> +     *
> +     * XXX: This state should only be reached from smtp_reuse_session(),
> +     * perhaps add a safety flag in iter that is co-requisite with a NULL
> +     * iter->rr, or else panic?
> +     */
> +    if (!iter->rr) {
> + tls->level = TLS_LEV_MAY;
> + return;
> +    }
> +
>     /* When TLSA lookups fail, we defer the message */
>     if ((dane = tls_dane_resolve(iter->port, "tcp", iter->rr,
> var_smtp_tls_force_tlsa)) == 0) {

Ralf, please try just this patch against the stock 20180618 snapshot,
and check as many of the below as you can:

  * The crashes are gone
  * DANE is still used when expected
  * TLS connection re-use happens under sustained load

We might want to log some sort of connection identifier
with re-used connections, that would make it possible
reliably find the original session to check that it
was authenticated as expected.  This is only useful
with TLS, so we could perhaps ask the TLS library for
"channel binding" at session creation, and log it
(on both ends) with the client logging it again on
*connection* re-use.  (TLS session resumption is
a separate matter and would produce a separate
channel fingerprint on each connection).

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Wietse Venema
In reply to this post by Viktor Dukhovni
Viktor Dukhovni:

> On Tue, Jun 19, 2018 at 01:22:53PM -0400, Wietse Venema wrote:
>
> > Unfortunately, this would be suboptimal when a site has muliple MX hosts
> > (It may end up making connections to each of them).
> >
> > Viktor's suggestion to skip the dane cache makes more sense.
> >
> > Viktor, cache wshould terminate after "postfix reload".
>
> In that case, perhaps the below will work?

It would not crash, but I don't think it would help.

First, the scache is indexed with keys that include the TLS security
level for a connection, so that we will never reuse a low-security
connection to deliver mail for a high-security destination. Thus
the lookups need to specify the security level that was in effect
when a connection was stored in the cache.

Second, it wants to look up the scache for the nexthop and primary
MXes to avoid contacting hosts that we have no cached connection
for. I think that these are the lookups that are failing because
state->iter->rr is not set. Would the patch below help?

        Wietse

*** ./smtp_connect.c- 2018-06-04 19:21:21.000000000 -0400
--- ./smtp_connect.c 2018-06-19 14:36:32.000000000 -0400
***************
*** 672,677 ****
--- 672,678 ----
       * for connection-cache lookup by request nexthop only.
       */
  #ifdef USE_TLS
+     iter->rr = *addr_list;
      if (!smtp_tls_policy_cache_query(why, state->tls, iter)) {
  msg_warn("TLS policy lookup error for %s/%s: %s",
  STR(iter->dest), STR(iter->host), STR(why->reason));
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Wietse Venema
In reply to this post by Viktor Dukhovni
Viktor Dukhovni:

>
>
> > On Jun 19, 2018, at 1:29 PM, Viktor Dukhovni <[hidden email]> wrote:
> >
> > In that case, perhaps the below will work?
> >
> > diff --git a/src/smtp/smtp_tls_policy.c b/src/smtp/smtp_tls_policy.c
> > index 13735b21..b5f72376 100644
> > --- a/src/smtp/smtp_tls_policy.c
> > +++ b/src/smtp/smtp_tls_policy.c
> > @@ -824,6 +824,20 @@ static void dane_init(SMTP_TLS_POLICY *tls, SMTP_ITERATOR *iter)
> > dane_incompat(tls, iter, NONDANE_DEST, "non DNSSEC destination");
> > return;
> >     }
> > +
> > +    /*
> > +     * Without the context of a particular MX host, the nexthop is merely a
> > +     * opportunistic "candidate" for DANE policy.  Just return "may" for now.
> > +     *
> > +     * XXX: This state should only be reached from smtp_reuse_session(),
> > +     * perhaps add a safety flag in iter that is co-requisite with a NULL
> > +     * iter->rr, or else panic?
> > +     */
> > +    if (!iter->rr) {
> > + tls->level = TLS_LEV_MAY;
> > + return;
> > +    }
> > +
> >     /* When TLSA lookups fail, we defer the message */
> >     if ((dane = tls_dane_resolve(iter->port, "tcp", iter->rr,
> > var_smtp_tls_force_tlsa)) == 0) {
>
> Ralf, please try just this patch against the stock 20180618 snapshot,
> and check as many of the below as you can:

This will stop crashes but it will also not find any connections
that are cached under the nexthop name, and that were stored with
a different TLS security level.

The connection cache storage key contains the TLS security level
that is in effect at the time that the connection is stored, so
that we will never deliver mail for a high-security destination
over a reused low-security connection. BTW it currently not reuse
connections when the security level requires certificate verification,
because the certificate information is currently not stored in the
connection cache.

Not finding connections cached under nexthop, when primary MXes are
unavailable, means Postfix will waste time commecting to unavailable
MX hosts before it finds something that works.

Finally, can anyone tell me how to reproduce these crashes? This
is extremely frustrating. I can do DNSSSEC lookups on freeBSD and
LINUX.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Viktor Dukhovni
In reply to this post by Wietse Venema
On Jun 19, 2018, at 2:38 PM, Wietse Venema <[hidden email]> wrote:
>
> It would not crash, but I don't think it would help.
>
> First, the scache is indexed with keys that include the TLS security
> level for a connection, so that we will never reuse a low-security
> connection to deliver mail for a high-security destination.

The problem is that with DANE we don't know the security level
until we're working with a particular MX host.  Or alternatively,
we should save the original "DANE candidate" level for recording
in the session cache for nexthop entries, and then use the actual
level and TLS properties for the per-address entries.

IIRC the cache key also includes the delivery agent name,
so we never mix delivery agents with different CA stores,
TLS policy tables, ...

In which case when looking for some connection to the nexthop,
I think it could be simpler to assume that if the connection
match the policy when it was recently created (we have a
connection re-use TTL) then it should still be good enough
without salting in the TLS details.

Salting in TLS details is of course needed for the per-peer
address cache lookups, where the same host might serve
multiple domains.


> Thus
> the lookups need to specify the security level that was in effect
> when a connection was stored in the cache.

This is not possible with DANE without some further changes,
but I am not convinced it is necessary.

> Second, it wants to look up the scache for the nexthop and primary
> MXes to avoid contacting hosts that we have no cached connection
> for. I think that these are the lookups that are failing because
> state->iter->rr is not set. Would the patch below help?

It would not always result in correct behaviour, the cached
nexthop connection might not be associated with the first
MX host.  So this too would not crash, but may not produce
correct results.  If, for example, the first MX host has
DANE TLSA records that don't match, we presumably (really
must) close the connection without caching it, and may
make a connection to a backup MX without DANE TLSA RRs
(insecure, but that's what we have).  But then the
nexthop connection's security level does not match
the security level of the initial MX.

Note also that DANE policy lookup may fail for the
initial MX (bad DNSSEC...), but we should still
be able to re-use cached connections via the second
MX, ...

So I don't think the below is correct.  One might
argue that my patch should simply leave the TLS
level alone (rather than set it to "may"), and
that the cache entry for the nexthop should reflect
that preliminary nexthop security level, before it
is potentially modified for MX hosts that lack TLSA
records...

> *** ./smtp_connect.c- 2018-06-04 19:21:21.000000000 -0400
> --- ./smtp_connect.c 2018-06-19 14:36:32.000000000 -0400
> ***************
> *** 672,677 ****
> --- 672,678 ----
>       * for connection-cache lookup by request nexthop only.
>       */
>  #ifdef USE_TLS
> +     iter->rr = *addr_list;
>      if (!smtp_tls_policy_cache_query(why, state->tls, iter)) {
>   msg_warn("TLS policy lookup error for %s/%s: %s",
>   STR(iter->dest), STR(iter->host), STR(why->reason));

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Wietse Venema
Viktor Dukhovni:

> On Jun 19, 2018, at 2:38 PM, Wietse Venema <[hidden email]> wrote:
> >
> > It would not crash, but I don't think it would help.
> >
> > First, the scache is indexed with keys that include the TLS security
> > level for a connection, so that we will never reuse a low-security
> > connection to deliver mail for a high-security destination.
>
> The problem is that with DANE we don't know the security level
> until we're working with a particular MX host.  Or alternatively,
> we should save the original "DANE candidate" level for recording
> in the session cache for nexthop entries, and then use the actual
> level and TLS properties for the per-address entries.

That assumes that the TLS policy for a given (delivery agent, nexthop, ...)
is stable, which is a reasonable assumption.

> IIRC the cache key also includes the delivery agent name,
> so we never mix delivery agents with different CA stores,
> TLS policy tables, ...

Right. We trust in belts, suspenders, and paranoia.

> In which case when looking for some connection to the nexthop,
> I think it could be simpler to assume that if the connection
> match the policy when it was recently created (we have a
> connection re-use TTL) then it should still be good enough
> without salting in the TLS details.

Which is the same thing as fixing it to "may" as in your
previous patch.

> Salting in TLS details is of course needed for the per-peer
> address cache lookups, where the same host might serve
> multiple domains.
 
> So I don't think the below is correct.  One might
> argue that my patch should simply leave the TLS
> level alone (rather than set it to "may"), and
> that the cache entry for the nexthop should reflect
> that preliminary nexthop security level, before it
> is potentially modified for MX hosts that lack TLSA
> records...

But then we'd have to keep that preliminary nexthop security level
around until scache_save_mumble are called. It may be clearer to
use a dummy level to indicate that this is not really used as a TLS
level.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Wietse Venema
Wietse Venema:

> Viktor Dukhovni:
> > On Jun 19, 2018, at 2:38 PM, Wietse Venema <[hidden email]> wrote:
> > >
> > > It would not crash, but I don't think it would help.
> > >
> > > First, the scache is indexed with keys that include the TLS security
> > > level for a connection, so that we will never reuse a low-security
> > > connection to deliver mail for a high-security destination.
> >
> > The problem is that with DANE we don't know the security level
> > until we're working with a particular MX host.  Or alternatively,
> > we should save the original "DANE candidate" level for recording
> > in the session cache for nexthop entries, and then use the actual
> > level and TLS properties for the per-address entries.
>
> That assumes that the TLS policy for a given (delivery agent, nexthop, ...)
> is stable, which is a reasonable assumption.
>
> > IIRC the cache key also includes the delivery agent name,
> > so we never mix delivery agents with different CA stores,
> > TLS policy tables, ...
>
> Right. We trust in belts, suspenders, and paranoia.
>
> > In which case when looking for some connection to the nexthop,
> > I think it could be simpler to assume that if the connection
> > match the policy when it was recently created (we have a
> > connection re-use TTL) then it should still be good enough
> > without salting in the TLS details.
>
> Which is the same thing as fixing it to "may" as in your
> previous patch.
>
> > Salting in TLS details is of course needed for the per-peer
> > address cache lookups, where the same host might serve
> > multiple domains.
>  
> > [...]  One might
> > argue that my patch should simply leave the TLS
> > level alone (rather than set it to "may"), and
> > that the cache entry for the nexthop should reflect
> > that preliminary nexthop security level, before it
> > is potentially modified for MX hosts that lack TLSA
> > records...
>
> But then we'd have to keep that preliminary nexthop security level
> around until scache_save_mumble are called. It may be clearer to
> use a dummy level to indicate that this is not really used as a TLS
> level.

And that is what I will try to implement: a dummy security
level for nexthop-level connection cache lookup keys.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Viktor Dukhovni
In reply to this post by Wietse Venema


> On Jun 19, 2018, at 3:07 PM, Wietse Venema <[hidden email]> wrote:
>
> Viktor Dukhovni:
>> On Jun 19, 2018, at 2:38 PM, Wietse Venema <[hidden email]> wrote:
>>> Or alternatively,
>>> we should save the original "DANE candidate" level for recording
>>> in the session cache for nexthop entries, and then use the actual
>>> level and TLS properties for the per-address entries.
>
> That assumes that the TLS policy for a given (delivery agent, nexthop, ...)
> is stable, which is a reasonable assumption.

That's my take also, on the timescale of the connection reuse
maximum TTL.

>> In which case when looking for some connection to the nexthop,
>> I think it could be simpler to assume that if the connection
>> match the policy when it was recently created (we have a
>> connection re-use TTL) then it should still be good enough
>> without salting in the TLS details.
>
> Which is the same thing as fixing it to "may" as in your
> previous patch.

Yes, but perhaps better to just use the pre-MX policy
without modification.  Thinking ahead to MTA-STS, when
a first message triggers background MTA-STS policy
discovery, we might want messages that arrive after
that's available to immediately upgrade to STS.

This suggests that the base (nexthop-based) level should
be used and checked.  If that changes, we make a new
connection.

[ In actuality we'll need some new way to enable either or
  both of MTA-STS and DANE, and when both are enabled prefer
  DANE for those MX hosts that have TLSA records, but otherwise
  do MTA-STS if that's provisioned at the remote domain. But
  that's for 2019. ]


>> So I don't think the below is correct.  One might
>> argue that my patch should simply leave the TLS
>> level alone (rather than set it to "may"), and
>> that the cache entry for the nexthop should reflect
>> that preliminary nexthop security level, before it
>> is potentially modified for MX hosts that lack TLSA
>> records...
>
> But then we'd have to keep that preliminary nexthop security level
> around until scache_save_mumble are called. It may be clearer to
> use a dummy level to indicate that this is not really used as a TLS
> level.

Instead of a dummy, I think for now just keep the
level that comes out of TLS policy lookup, before
DANE messes with it on a per-MX basis.

We could later consider refactoring, so that the DANE
level always remains unchanged, but is treated as "may"
dynamically, when no TLSA data accompanies the client
policy.

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Wietse Venema
Viktor Dukhovni:

>
>
> > On Jun 19, 2018, at 3:07 PM, Wietse Venema <[hidden email]> wrote:
> >
> > Viktor Dukhovni:
> >> On Jun 19, 2018, at 2:38 PM, Wietse Venema <[hidden email]> wrote:
> >>> Or alternatively,
> >>> we should save the original "DANE candidate" level for recording
> >>> in the session cache for nexthop entries, and then use the actual
> >>> level and TLS properties for the per-address entries.
> >
> > That assumes that the TLS policy for a given (delivery agent, nexthop, ...)
> > is stable, which is a reasonable assumption.
>
> That's my take also, on the timescale of the connection reuse
> maximum TTL.
>
> >> In which case when looking for some connection to the nexthop,
> >> I think it could be simpler to assume that if the connection
> >> match the policy when it was recently created (we have a
> >> connection re-use TTL) then it should still be good enough
> >> without salting in the TLS details.
> >
> > Which is the same thing as fixing it to "may" as in your
> > previous patch.
>
> Yes, but perhaps better to just use the pre-MX policy
> without modification.  Thinking ahead to MTA-STS, when
> a first message triggers background MTA-STS policy
> discovery, we might want messages that arrive after
> that's available to immediately upgrade to STS.

smtp_reuse_session() is called if SMTP_MISC_FLAG_CONN_LOAD is set,
whereas smtp_save_session() is called if SMTP_MISC_FLAG_CONN_STORE
is set. These guards are different for a reason.

Therefore we cannot assume that smtp_reuse_session() will be able to
determine the pre-MX policy to be used by for smtp_save_session(),
unless I mis-understand what you mean with pre-MX policy (I assume
the TLS policy based on the nexhop info only).

I'll keep it simple for now. and use an manifest constant for
nexthop-level connection cache lookups and stores. We can refine
the logic later.

        Wietse

> This suggests that the base (nexthop-based) level should
> be used and checked.  If that changes, we make a new
> connection.
>
> [ In actuality we'll need some new way to enable either or
>   both of MTA-STS and DANE, and when both are enabled prefer
>   DANE for those MX hosts that have TLSA records, but otherwise
>   do MTA-STS if that's provisioned at the remote domain. But
>   that's for 2019. ]
>
>
> >> So I don't think the below is correct.  One might
> >> argue that my patch should simply leave the TLS
> >> level alone (rather than set it to "may"), and
> >> that the cache entry for the nexthop should reflect
> >> that preliminary nexthop security level, before it
> >> is potentially modified for MX hosts that lack TLSA
> >> records...
> >
> > But then we'd have to keep that preliminary nexthop security level
> > around until scache_save_mumble are called. It may be clearer to
> > use a dummy level to indicate that this is not really used as a TLS
> > level.
>
> Instead of a dummy, I think for now just keep the
> level that comes out of TLS policy lookup, before
> DANE messes with it on a per-MX basis.
>
> We could later consider refactoring, so that the DANE
> level always remains unchanged, but is treated as "may"
> dynamically, when no TLSA data accompanies the client
> policy.
>
> --
> Viktor.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: multiple deliveries per TLS-encrypted connection

Ralf Hildebrandt-2
In reply to this post by Viktor Dukhovni
* Viktor Dukhovni <[hidden email]>:

> Ralf, please try just this patch against the stock 20180618 snapshot,
> and check as many of the below as you can:
>
>   * The crashes are gone
>   * DANE is still used when expected
>   * TLS connection re-use happens under sustained load
>
> We might want to log some sort of connection identifier
> with re-used connections, that would make it possible
> reliably find the original session to check that it
> was authenticated as expected.  This is only useful
> with TLS, so we could perhaps ask the TLS library for
> "channel binding" at session creation, and log it
> (on both ends) with the client logging it again on
> *connection* re-use.  (TLS session resumption is
> a separate matter and would produce a separate
> channel fingerprint on each connection).

I was a bit out of the loop. Currently I'm using 20180624 and the
problems seem to be gone.

--
[*] sys4 AG

https://sys4.de, +49 (89) 30 90 46 64
Schleißheimer Straße 26/MG, 80333 München
                                           
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
12