Message encoding by guessing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Message encoding by guessing

wesley

Hi

How to guess the message body’s language encoding if message didn’t have MIME charset set?  The message may be encoded with utf8, gb2312, gbk or something others, but it didn’t have an charset header.

Thanks 

Reply | Threaded
Open this post in threaded view
|

Re: Message encoding by guessing

Bastian Blank-3
On Sun, Feb 09, 2020 at 01:45:21PM +0300, [hidden email] wrote:
> How to guess the message body’s language encoding if message didn’t have MIME charset set?  The message may be encoded with utf8, gb2312, gbk or something others, but it didn’t have an charset header.

Well, text/*, with the historic exception of text/html, default to
ASCII, if nothing else is defined, so no 8-bit characters are allowed.

Bastian

--
The heart is not a logical organ.
                -- Dr. Janet Wallace, "The Deadly Years", stardate 3479.4
Reply | Threaded
Open this post in threaded view
|

Re: Message encoding by guessing

Viktor Dukhovni
In reply to this post by wesley
On Sun, Feb 09, 2020 at 01:45:21PM +0300, [hidden email] wrote:

> How to guess the message body’s language encoding if message didn’t
> have MIME charset set?  The message may be encoded with utf8, gb2312,
> gbk or something others, but it didn’t have an charset header.

You could run the text through "iconv -f <take-a-guess>", and
see what comes out.

For valid (correctly minimally encoded) utf-8:

    https://en.wikipedia.org/wiki/UTF-8#Description

every non-ascii character sequence starts with an initial byte that is
in the range:

        0b11000010  - 0xc2 hex or 194 decimal, through:
        0b11110100 -- 0xf4 hex or 244 decimal

and continues with more bytes that are all in the range

    0x10xxxxxx  - 0x80--0xbf hex or 128--191 decimal

the number of such bytes in each group (including the initial byte) is
equal to the number of consecutive non-zero bits starting with the MSB
in the first byte.

For some random other code point, good luck!  But Windows-1232 is pretty
common for things mostly in the Latin alphabet.

--
    Viktor.