简体   繁体   English

IMAP消息中的未知编码

[英]Unknown Encoding in IMAP Message

I am obtaining text/HTML BODY parts of email messages using the IMAP protocol. 我正在使用IMAP协议获取电子邮件的文本/ HTML BODY部分。

For this, what I do is use the BODYSTRUCTURE call to obtain the BODY index and the charset of a part, then use the BODY[INDEX] call, obtain the raw text, and try to decode it using the Python decode function. 为此,我所做的是使用BODYSTRUCTURE调用来获取BODY索引和部件的字符集,然后使用BODY[INDEX]调用,获取原始文本,并尝试使用Python解码函数对其进行解码。

Now my problem is, even after decoding some text parts with the given charsets (charset obtained from the BODYSTRUCTURE call together with that part), they are still encoded with some unknown encoding. 现在我的问题是,即使用给定的字符集解码了一些文本部分(从BODYSTRUCTURE调用与该部分一起获得的字符集),它们仍然使用一些未知编码进行编码。

Only Portuguese/Spanish/other latin text comes with this problem, and therefore I assume this is some kind of Portuguese/Spanish encoding. 只有葡萄牙语/西班牙语/其他拉丁语文本出现此问题,因此我认为这是某种葡萄牙语/西班牙语编码。

Now my problem is, how do I detect this occurrence and properly decode it? 现在我的问题是,如何检测到这种情况并正确解码? First of all I assume decoding the text with the given charset should leave no encoded characters, but if that does happen, as it is happening right now, how do I find a universal way to decode these characters? 首先,我假设使用给定的字符集解码文本应该不留下编码字符,但如果确实发生了,就像现在正在发生的那样,我如何找到解码这些字符的通用方法?

I assume I could just try a list of common charsets and do a try: except: cycle for all of those to try and decode the given text, but I would honestly prefer a better solution. 我假设我可以尝试一个常见的字符集列表并try: except:循环所有这些尝试和解码给定的文本,但我真的希望更好的解决方案。

Pseudocode is something like this: 伪代码是这样的:

# Obtain BODYSTRUCTURE call
data, result = imap_instance.uid('fetch', email_uid, '(BODYSTRUCTURE)')
part_body_index, part_charset = parse_BODY_index_and_charset_from_response(data)

text_part, result = imap_instance.uid('fetch', email_uid, '(BODY['+str(part_body_index)+'])')

if len(part_charset) > 0:
    try:
        text_part = text_part.decode(part_charset, 'ignore')
    except:
        pass

# Content of "text_part" variable after this should be text with no encoded characters...
# But that's not the case

Examples of encoded text: 编码文本的示例:

A 05/04/2013, =E0s 11:09, XYZ escreveu:>

This text was encoded with iso-8859-1, decoded it and still like this. 这个文本用iso-8859-1编码,解码后仍然像这样。 Symbol =E0 in string is character "À". 符号=字符串中的E0是字符“À”。

In=EDcio da mensagem reenviada:

This text was encoded with windows-1252, decoded it and still like this. 这个文本用windows-1252编码,解码后仍然像这样。 Symbol =ED in string is character "í". 符号=字符串中的ED是字符“í”。

You need to look at the Content-Transfer-Encoding information (which is actually returned in the BODYSTRUCTURE responses). 您需要查看Content-Transfer-Encoding信息(实际上在BODYSTRUCTURE响应中返回)。 You'll need to support both base64 and quoted-printable decoding -- this transforms the binary data (like UTF-8 or even ISO-8859-1 encoding of a given text) into a 7bit form which is safe for an e-mail transfer. 您需要同时支持base64quoted-printable解码 - 这会将二进制数据(如UTF-8甚至是给定文本的ISO-8859-1编码)转换为7bit格式,这对电子邮件是安全的传递。 Only after you've undone the content transfer encoding should you go ahead and decode the text from a character encoding (like UTF-8, or windows-1250, or ISO-8859-x, or...) to its Unicode representation that you work with. 只有在您撤消内容传输编码后,才能继续将字符编码(如UTF-8,或Windows-1250或ISO-8859-x或...)中的文本解码为其Unicode表示形式你工作。

Both of your examples are encoded using quoted-printable. 您的两个示例都使用quoted-printable进行编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM