Discrepancy in text/plain content encoding returned by Gmail API

Question

I am experimenting with reading multipart/mixed emails with GMail API.
The goal is to correctly decode each text/plain part of the multipart/mixed email (there can be many, in different encodings) to a C# string (ie UTF-16):

public static string DecodeTextPart(Google.Apis.Gmail.v1.Data.MessagePart part)
{
    var content_type_header = part.Headers.FirstOrDefault(h => string.Equals(h.Name, "content-type", StringComparison.OrdinalIgnoreCase));

    if (content_type_header == null)
        throw new ArgumentException("No content-type header found in the email part");

    var content_type = new System.Net.Mime.ContentType(content_type_header.Value);

    if (!string.Equals(content_type.MediaType, "text/plain", StringComparison.OrdinalIgnoreCase))
        throw new ArgumentException("The part is not text/plain");

    return Encoding.GetEncoding(content_type.CharSet).GetString(GetAttachmentBytes(part.Body));
}

GetAttachmentBytes returns raw attachment bytes, without conversion, decoded from the base64url encoding that GMail uses.

What I find is that in many cases this produces invalid strings, because the raw bytes that I get for the attachment content appear to always be in UTF-8, even though content-type of that same part declares otherwise.

Eg given the email:

Date: ...
From: ...
Reply-To: ...
Message-ID: ...
To: ...
Subject: Test 1 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----------0E50FC0802A2FCCAA"

------------0E50FC0802A2FCCAA
Content-Type: text/plain; charset=windows-1251
Content-Transfer-Encoding: 8bit


Content test: Cyrillic, Windows-1251 (à, ÿ, æ)
------------0E50FC0802A2FCCAA
Content-Type: TEXT/PLAIN;
 name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
 filename="Irrelevant.txt"

VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0E50FC0802A2FCCAA--

, I successfully find the first part, the code above figures that it's charset=windows-1251 with the help of System.Net.Mime.ContentType , and then .GetString() returns garbage because the actual raw bytes returned by GetAttachmentBytes correspond to UTF-8 encoding, not Windows-1251.

Exactly the same happens with

Subject: Test 2 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----------0B716C1D8123D8710"

------------0B716C1D8123D8710
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 8bit


Content test: Cyrillic, koi-8 (Б, С, Ц)
------------0B716C1D8123D8710
Content-Type: TEXT/PLAIN;
 name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
 filename="Irrelevant.txt"

VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0B716C1D8123D8710--

Note that the three test letters in the parentheses after the encoding name are the same in both emails, and in Unicode look like (а, я, ж) , but (correctly) look wrong in the email body represenatation quoted above due to different encodings.

If I "fix" the function to always use Encoding.UTF8 instead of GetEncoding(content_type.CharSet) , then it appears to work in the tests that I've done so far.

At the same time, the GMail interface displays the letters correctly in both cases, so it must have correctly parsed the incoming emails using the correct declared encodings.

Is it the case that the GMail API re-encodes all text chunks into UTF-8 (wrapped in base64url), but reports the original charset for them?
Am I therefore supposed to always use UTF-8 with GMail API and disregard content-type 's charset= ?
Or is there a problem with my code?

Answer 1

According to these two resources:

The Value is indeed a base-64 encoded representation of the part converted to UTF-8 .

This is however not documented by Google, as far as I can find.

Discrepancy in text/plain content encoding returned by Gmail API

Question

1 answers

solution1
2 ACCPTED 2020-01-09 14:35:17

Discrepancy in text/plain content encoding returned by Gmail API

Question

1 answers

solution1 2 ACCPTED 2020-01-09 14:35:17

solution1
2 ACCPTED 2020-01-09 14:35:17