Add UTF8 support in PostgreSQL lookup table interface

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Add UTF8 support in PostgreSQL lookup table interface

Geir Pedersen
Hi Guys,

The dictionary interface to Postgresql found in src/global/dict_pgsql.c does not support UTF8. It explicitly telles the database that Postfix will send LATIN1. 

With SMTPUTF8 support now in place, Postfix may try to look up addresses with UTF8 in the local part in PostgreSQL virtual mailbox maps. 

Sunch lookups now fail as the UTF8 sent by Postfix is taken as LATIN1 by PostgreSQL. Error message from Postfix: "Recipient address rejected: User unknown in virtual mailbox table"

As far as I can see Postfix now wraps all invocations of the Postgresql dictionary and other dictionaries in a filter that will reject invalid UTF8. 

I suggest the following trivial patch which seems to fix the problem for me. 

Geir Pedersen.


--- foo/postfix-3.3.1/src/global/dict_pgsql.c   2017-12-27 22:53:13.000000000 +0100
+++ postfix-3.3.1/src/global/dict_pgsql.c       2018-08-23 22:09:17.710170053 +0200
@@ -644,15 +644,14 @@
                 host->hostname);

     /*
-     * XXX Postfix does not send multi-byte characters. The following piece
-     * of code is an explicit statement of this fact, and the database server
-     * should not accept multi-byte information after this point.
+     * Tell PostgreSQL we will send UTF8.
      */
-    if (PQsetClientEncoding(host->db, "LATIN1") != 0) {
-       msg_warn("dict_pgsql: cannot set the encoding to LATIN1, skipping %s",
-                host->hostname);
-       plpgsql_down_host(host);
-       return;
+
+    if (PQsetClientEncoding(host->db, "UTF8") != 0) {
+      msg_warn("dict_pgsql: cannot set the encoding to UTF8, skipping %s",
+               host->hostname);
+      plpgsql_down_host(host);
+      return;
     }
     /* Success. */
     host->stat = STATACTIVE;

Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Viktor Dukhovni


> On Aug 23, 2018, at 6:39 PM, Geir Pedersen <[hidden email]> wrote:
>
> The dictionary interface to Postgresql found in src/global/dict_pgsql.c does not support UTF8. It explicitly telles the database that Postfix will send LATIN1.

Absent any indication of character set from the client, there was
no way to know what encoding any particular non-ASCII octet
string may be using, so the code was optimized to avoid spurious
database string conversion errors, by using an encoding that
would accept any octet-string, garbage-in -> garbage-out.

> With SMTPUTF8 support now in place, Postfix may try to look up addresses with UTF8 in the local part in PostgreSQL virtual mailbox maps.

While Postfix now supports UTF8, it is not always enabled, and
even when enabled not always used by the client.  So using
UTF-8 on the database connection may not always be appropriate.

Ideally we'd only use UTF-8 when the client indicates that is
using SMTPUTF8:

  MAIL FROM:<[hidden email]> BODY=8BITMIME SMTPUTF8

> Such lookups now fail as the UTF8 sent by Postfix is taken as LATIN1
> by PostgreSQL. Error message from Postfix:
> "Recipient address rejected: User unknown in virtual mailbox table"

This means that we'd a way to dynamically update the client
encoding of the database connection to UTF8 when appropriate
and revert it LATIN1 when the client encoding is unspecified.

And this needs to work across the proxymap protocol.  So while
the change you're proposing is well motivated, I am not sure
that the solution is as simple as you propose.  We'd need to
add a query-time UTF8 flag to the low-level dictionary lookup
methods, implement the higher-level lookups on top of a default
octet-string (LATIN1 if you like) encoding, and add new functions
that perform similar lookups on UTF8 data.

The PostgresSQL driver would then export a function to switch
the client connection to UTF8 (assuming the encoding can be
changed on the fly between queries).

--
--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Andrew Sullivan
Hi,

On Thu, Aug 23, 2018 at 07:02:27PM -0400, Viktor Dukhovni wrote:
 
> Absent any indication of character set from the client, there was
> no way to know what encoding any particular non-ASCII octet
> string may be using, so the code was optimized to avoid spurious
> database string conversion errors, by using an encoding that
> would accept any octet-string, garbage-in -> garbage-out.

Unless I misunderstand you (and I might well be doing so) I don't
think LATIN1 works that way in Postgres.  The backend's encoding and
the front end's need to be translation-compatible, but that would be
true of a SQL_ASCII back end with literally any front end encoding (no
conversion is specified to a back end in SQL_ASCII).  If you're going
to get multibyte strings, I don't see how LATIN1 is any less likely to
throw errors than UTF8 (and it is slightly more likely, if you have a
UTF-8 multi-octet code point that maps into ISO 8859-1 space in the
wrong way, or if the target environment fails to use ISO 8859-1.  It's
true that the characters in that case will probably map, but they'll
fail anyway since the match could as easily be wrong as right).

> This means that we'd a way to dynamically update the client
> encoding of the database connection to UTF8 when appropriate
> and revert it LATIN1 when the client encoding is unspecified.

You can do this in libpq and also in commands passed on a regular connection:

    SET CLIENT_ENCODING TO 'value';

(see https://www.postgresql.org/docs/10/static/multibyte.html).  I
forget whether this can be done inside a transaction, but it seems a
fabulously bad idea to change encodings mid-transaction anyway.

> And this needs to work across the proxymap protocol.

The problem could well be here; I'm insufficiently familiar with its
internals to comment.

Best regards,

A

--
Andrew Sullivan
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Viktor Dukhovni
> On Aug 24, 2018, at 2:33 PM, Andrew Sullivan <[hidden email]> wrote:
>
> On Thu, Aug 23, 2018 at 07:02:27PM -0400, Viktor Dukhovni wrote:
>
>> Absent any indication of character set from the client, there was
>> no way to know what encoding any particular non-ASCII octet
>> string may be using, so the code was optimized to avoid spurious
>> database string conversion errors, by using an encoding that
>> would accept any octet-string, garbage-in -> garbage-out.
>
> Unless I misunderstand you (and I might well be doing so) I don't
> think LATIN1 works that way in Postgres.

The point isn't really about Postgres per-se, but rather that all
octet strings are valid encodings of some LATIN1 string, since
every octet is a valid LATIN1 code-point.  The same cannot be said
of UTF8, since random octet strings are generally invalid.

> If you're going
> to get multibyte strings, I don't see how LATIN1 is any less likely to
> throw errors than UTF8

We don't get "multi-byte" strings, absent SMTPUTF8 we get octet strings
that are either ASCII, or something else unspecified that violates the
SMTP protocol.  The something else unspecified will be some valid (
intended or otherwise) LATIN1 string.  Its use in database queries
with a LATIN1 client encoding will not throw perplexing errors.

> It's
> true that the characters in that case will probably map, but they'll
> fail anyway since the match could as easily be wrong as right).

The sending system violated the SMTP protocol, garbage-in -> garbage out.
The folks who tend to be lazy about encodings have generally been the
ones with some flavour of a single-byte ISO-8859-X encoding, so we'll
continue to treat unspecified 8-bit input as LATIN1.

>> This means that we'd a way to dynamically update the client
>> encoding of the database connection to UTF8 when appropriate
>> and revert it LATIN1 when the client encoding is unspecified.
>
> You can do this in libpq and also in commands passed on a regular connection:
>
>    SET CLIENT_ENCODING TO 'value';

Yes, but that'd have to be done by the dictionary lookup layer,
possibly in proxymap, based on a suitable signal from the lookup
client, but the low-level API (dict_get()) does not presently
support any per-lookup flags.  So we'd need dict_get_ex() that
takes a new utf8 flag and supporting changes throughout the
code.  This is a major change.

Perhaps there's a clever way to avoid this, but I am not seeing it yet.

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Wietse Venema
Viktor Dukhovni:
> Yes, but that'd have to be done by the dictionary lookup layer,
> possibly in proxymap, based on a suitable signal from the lookup
> client, but the low-level API (dict_get()) does not presently
> support any per-lookup flags.  So we'd need dict_get_ex() that
> takes a new utf8 flag and supporting changes throughout the
> code.  This is a major change.
>
> Perhaps there's a clever way to avoid this, but I am not seeing it yet.

I don't think that it is a good idea to signal the query string
encoding through the Postfix dict(3) API. Primarily because there
is no legitimate use for non-UTF8 in envelopes and headers, so why
bother adding complexity now for a dying cause?

Relevant for the $Subject: in UTF8 mode, the Postfix dict API will
never look up a non-UTF-8 string, because it cannot exist, and it
will never return a non-UTF-8 lookup result, for the same reasons.
If non-UTF8 content were allowed in UTF-8 mode, then every string
would be a potential bomb ready to explode.

Even more relevant for the $Subject: several Postfix lookup tables
that already block non-UTF8 queries when smtputf8 mode is turned
off (dict_ldap.c, dict_sqlite.c):

    /*
     * Don't frustrate future attempts to make Postfix UTF-8 transparent.
     */
    if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
        && !valid_utf8_string(name, strlen(name))) {
        if (msg_verbose)
            msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
                     myname, dict_ldap->parser->name, name);
        return (0);
    }

This code has been there for four years already. Never heard a peep.

I think that we should by default extend this to all lookup tables
and do away with that LATIN1 hack. That would also address the
problem that underlies the original request. No need for signaling
the request encoding!

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Andrew Sullivan
In reply to this post by Viktor Dukhovni
On Fri, Aug 24, 2018 at 02:55:19PM -0400, Viktor Dukhovni wrote:

> SMTP protocol.  The something else unspecified will be some valid (
> intended or otherwise) LATIN1 string.  Its use in database queries
> with a LATIN1 client encoding will not throw perplexing errors.

Ok, I get it.  I will note that this is true only for some values of
"perplexing".  For instance, to a user, the strings
[hidden email] and aj[ZWNJ][hidden email] look
identical, so the fact that the second one doesn't match will be
perplexing.  This is no criticism: perplexity in i18n is pretty
normal.  But I get it that you won't get some sort of back end
encoding error.

> support any per-lookup flags.  So we'd need dict_get_ex() that
> takes a new utf8 flag and supporting changes throughout the
> code.  This is a major change.

Sounds like, yes.

A

--
Andrew Sullivan
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Viktor Dukhovni
In reply to this post by Wietse Venema
On Fri, Aug 24, 2018 at 04:54:32PM -0400, Wietse Venema wrote:

> I think that we should by default extend this to all lookup tables
> and do away with that LATIN1 hack. That would also address the
> problem that underlies the original request. No need for signaling
> the request encoding!

If the Postgres dictionary client-encoding changes from LATIN1 to
UTF8, at what layer do you want to convert "unspecified" non-ASCII
lookup keys UTF8 (by encoding each non-ASCII byte to 2 UTF-8 bytes)?

If we don't do that, we'll get dictionary lookup errors for invalid
input strings, rather than no-match...

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Viktor Dukhovni
On Fri, Aug 24, 2018 at 05:14:46PM -0400, Viktor Dukhovni wrote:

> On Fri, Aug 24, 2018 at 04:54:32PM -0400, Wietse Venema wrote:
>
> > I think that we should by default extend this to all lookup tables
> > and do away with that LATIN1 hack. That would also address the
> > problem that underlies the original request. No need for signaling
> > the request encoding!
>
> If the Postgres dictionary client-encoding changes from LATIN1 to
> UTF8, at what layer do you want to convert "unspecified" non-ASCII
> lookup keys UTF8 (by encoding each non-ASCII byte to 2 UTF-8 bytes)?
>
> If we don't do that, we'll get dictionary lookup errors for invalid
> input strings, rather than no-match...

It seems you just want to skip the lookup instead...  I guess that'll
work too.

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Wietse Venema
In reply to this post by Viktor Dukhovni
Viktor Dukhovni:

> On Fri, Aug 24, 2018 at 04:54:32PM -0400, Wietse Venema wrote:
> > Even more relevant for the $Subject: several Postfix lookup tables
> > already block non-UTF8 queries when smtputf8 mode is turned
> > off (dict_ldap.c, dict_sqlite.c):
> >  
> >     /*
> >      * Don't frustrate future attempts to make Postfix UTF-8 transparent.
> >      */
> >     if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
> >         && !valid_utf8_string(name, strlen(name))) {
> >         if (msg_verbose)
> >             msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
> >                      myname, dict_ldap->parser->name, name);
> >         return (0);
> >     }
> >  ...
> > I think that we should by default extend this to all lookup tables
> > and do away with that LATIN1 hack. That would also address the
> > problem that underlies the original request. No need for signaling
> > the request encoding!
>
> If the Postgres dictionary client-encoding changes from LATIN1 to
> UTF8, at what layer do you want to convert "unspecified" non-ASCII
> lookup keys UTF8 (by encoding each non-ASCII byte to 2 UTF-8 bytes)?

There is NO such conversion.

For the past four 3-4 years, Postfix has enforced that a query can
be only well-formed UTF-8 for LDAP and SQLite tables EVEN IF STMPUTF8
mode is turned off. And no-one has complained. I am arguing to
extend this filter to the other tables, possibly with backwards
compatibility.

There never was legitimate use of non-UTF8 in envelopes and headers.
It is a dying cause, therefore let's not extend its miserable life
by adding complexity to Postfix.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

John Fawcett
In reply to this post by Wietse Venema
On 24/08/18 22:54, Wietse Venema wrote:

> Viktor Dukhovni:
>> Yes, but that'd have to be done by the dictionary lookup layer,
>> possibly in proxymap, based on a suitable signal from the lookup
>> client, but the low-level API (dict_get()) does not presently
>> support any per-lookup flags.  So we'd need dict_get_ex() that
>> takes a new utf8 flag and supporting changes throughout the
>> code.  This is a major change.
>>
>> Perhaps there's a clever way to avoid this, but I am not seeing it yet.
> I don't think that it is a good idea to signal the query string
> encoding through the Postfix dict(3) API. Primarily because there
> is no legitimate use for non-UTF8 in envelopes and headers, so why
> bother adding complexity now for a dying cause?
>
> Relevant for the $Subject: in UTF8 mode, the Postfix dict API will
> never look up a non-UTF-8 string, because it cannot exist, and it
> will never return a non-UTF-8 lookup result, for the same reasons.
> If non-UTF8 content were allowed in UTF-8 mode, then every string
> would be a potential bomb ready to explode.
>
> Even more relevant for the $Subject: several Postfix lookup tables
> that already block non-UTF8 queries when smtputf8 mode is turned
> off (dict_ldap.c, dict_sqlite.c):
>
>     /*
>      * Don't frustrate future attempts to make Postfix UTF-8 transparent.
>      */
>     if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
>         && !valid_utf8_string(name, strlen(name))) {
>         if (msg_verbose)
>             msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
>                      myname, dict_ldap->parser->name, name);
>         return (0);
>     }
>
> This code has been there for four years already. Never heard a peep.
>
> I think that we should by default extend this to all lookup tables
> and do away with that LATIN1 hack. That would also address the
> problem that underlies the original request. No need for signaling
> the request encoding!
>
> Wietse

Wietse

is the above check done in the single dict modules redundant?

If DICT_FLAG_UTF8_ACTIVE is set it means that the lookup method has
already been overwritten by dict_utf8_lookup

(see util/dict_utf8.c)


    /*
     * Interpose on the lookup/update/delete methods. It is a conscious
     * decision not to tinker with the iterator or destructor.
     */
    backup->lookup = dict->lookup;
    backup->update = dict->update;
    backup->delete = dict->delete;

    dict->lookup = dict_utf8_lookup;
    dict->update = dict_utf8_update;
    dict->delete = dict_utf8_delete;

and where dict_utf8_dict does this check


    /*
     * Validate and optionally fold the key, and if invalid skip the
request.
     */
    if ((fold_res = dict_utf8_check_fold(dict, key, &utf8_err)) == 0) {
        msg_warn("%s:%s: non-UTF-8 key \"%s\": %s",
                 dict->type, dict->name, key, utf8_err);
        dict->error = DICT_ERR_NONE;
        return (0);
    }

John

Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Wietse Venema
Wietse:

>     /*
>      * Don't frustrate future attempts to make Postfix UTF-8 transparent.
>      */
>     if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
>         && !valid_utf8_string(name, strlen(name))) {
>         if (msg_verbose)
>             msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
>                      myname, dict_ldap->parser->name, name);
>         return (0);
>     }
>
> This code has been [in dict_ldap.c and dict_sqlite.c] for four
> years already. Never heard a peep.

John Fawcett:
> is the above check done in the single dict modules redundant?

No.

- The above filter executes only if UTF8 mode is OFF.

- The dict_utf8 filter that you refer to executes only if UTF8 mode
  is ON.

        Wietse
Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

John Fawcett
On 25/08/18 23:59, Wietse Venema wrote:

> Wietse:
>>     /*
>>      * Don't frustrate future attempts to make Postfix UTF-8 transparent.
>>      */
>>     if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
>>         && !valid_utf8_string(name, strlen(name))) {
>>         if (msg_verbose)
>>             msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
>>                      myname, dict_ldap->parser->name, name);
>>         return (0);
>>     }
>>
>> This code has been [in dict_ldap.c and dict_sqlite.c] for four
>> years already. Never heard a peep.
> John Fawcett:
>> is the above check done in the single dict modules redundant?
> No.
>
> - The above filter executes only if UTF8 mode is OFF.
>
> - The dict_utf8 filter that you refer to executes only if UTF8 mode
>   is ON.
>
> Wietse

ok, got it, I need to put smtputf8_enable = no to go through the code
path above.

The following trivially equivalent patch for mysql client seems to be ok
in my testing (ie gives same behaviour as now except for non valid utf8
characters in the lookup):

--- src/global/dict_mysql.c.orig    2018-08-26 10:14:29.085703480 +0200
+++ src/global/dict_mysql.c.new    2018-08-26 14:58:30.695898300 +0200
@@ -326,6 +326,18 @@
     dict->error = 0;
 
     /*
+    *      * Don't frustrate future attempts to make Postfix UTF-8
transparent.
+    */
+    if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
+        && !valid_utf8_string(name, strlen(name))) {
+        if (msg_verbose)
+            msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
+                     myname, dict_mysql->parser->name, name);
+        return (0);
+    }
+
+
+    /*
      * Optionally fold the key.
      */
     if (dict->flags & DICT_FLAG_FOLD_FIX) {


Maybe it would be better to put this code into dict_lookup() so it gets
used for all lookup tables, though that is more invasive and requires
testing across more table types.

On the original issue about Postgres, as you have stated, it would make
sense to take out the hard coded LATIN1 encoding. The configuration
could then be specified in configuration files
(https://www.postgresql.org/docs/9.3/static/libpq-pgservice.html)
similar to the way the client character set encoding can be configured
for dict_mysql. Alternatively the character set encoding to be read from
a new variable.

John

Reply | Threaded
Open this post in threaded view
|

Re: Add UTF8 support in PostgreSQL lookup table interface

Wietse Venema
John Fawcett:

> On 25/08/18 23:59, Wietse Venema wrote:
> > Wietse:
> >>     /*
> >>      * Don't frustrate future attempts to make Postfix UTF-8 transparent.
> >>      */
> >>     if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
> >>         && !valid_utf8_string(name, strlen(name))) {
> >>         if (msg_verbose)
> >>             msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
> >>                      myname, dict_ldap->parser->name, name);
> >>         return (0);
> >>     }
> >>
> >> This code has been [in dict_ldap.c and dict_sqlite.c] for four
> >> years already. Never heard a peep.
> > John Fawcett:
> >> is the above check done in the single dict modules redundant?
> > No.
> >
> > - The above filter executes only if UTF8 mode is OFF.
> >
> > - The dict_utf8 filter that you refer to executes only if UTF8 mode
> >   is ON.
> >
> > Wietse
>
> ok, got it, I need to put smtputf8_enable = no to go through the code
> path above.
>
> The following trivially equivalent patch for mysql client seems to be ok
> in my testing (ie gives same behaviour as now except for non valid utf8
> characters in the lookup):
>
> --- src/global/dict_mysql.c.orig??? 2018-08-26 10:14:29.085703480 +0200
> +++ src/global/dict_mysql.c.new??? 2018-08-26 14:58:30.695898300 +0200
> @@ -326,6 +326,18 @@
> ???? dict->error = 0;
> ?
> ???? /*
> +??? *????? * Don't frustrate future attempts to make Postfix UTF-8
> transparent.
> +??? */
> +??? if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0
> +??????? && !valid_utf8_string(name, strlen(name))) {
> +??????? if (msg_verbose)
> +??????????? msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'",
> +???????????????????? myname, dict_mysql->parser->name, name);
> +??????? return (0);
> +??? }
> +
> +
> +??? /*
> ????? * Optionally fold the key.
> ????? */
> ???? if (dict->flags & DICT_FLAG_FOLD_FIX) {
>
>
> Maybe it would be better to put this code into dict_lookup() so it gets
> used for all lookup tables, though that is more invasive and requires
> testing across more table types.

In non-UTF8 mode, that would change Postfix behavior with memcache,
Berkeley DB, LMDB, and other map types that currently don't care
about encodings.

Also, dict_lookup() would be the wrong place. It would miss all
the lookups by calling dict->get() directly.

> On the original issue about Postgres, as you have stated, it would make
> sense to take out the hard coded LATIN1 encoding. The configuration
> could then be specified in configuration files
> (https://www.postgresql.org/docs/9.3/static/libpq-pgservice.html)
> similar to the way the client character set encoding can be configured
> for dict_mysql. Alternatively the character set encoding to be read from
> a new variable.

In UTF8 mode, Postfix can only ask well-formed UTF8 queries. That
is the longer-term future; the vast majority of the web is already
90+% UTF8 (including ASCII).

In non-UTF8 mode, there are no valid non-ASCII queries, so all we
can do is to limit the damage while not breaking existing sites
unnecessarily.

        Wietse