简体   繁体   English

Python在html电子邮件附件中解码法语字符

[英]Python decode french char in html email attachment

I'm trying to decode an html attachment file of an email that I take on an IMAP Server. 我正在尝试解码我​​在IMAP服务器上发送的电子邮件的html附件文件。 If the html file contain normal character it's working without problem, but when you have some french é character I have this: "vous a \\xc3\\xa9t\\xc3\\xa9 envoy\\xc3\\xa9e par" I also have all \\n \\r that appear. 如果html文件包含普通字符,它的工作没有问题,但是当你有一些法语é字符时,我有这个: "vous a \\xc3\\xa9t\\xc3\\xa9 envoy\\xc3\\xa9e par"我也有所有\\n \\r \\n出现了。

I use beautifulsoup to make some search on the html code. 我使用beautifulsoup来搜索html代码。 I also use a loop to check all the mail(Not present in this code) 我还使用循环来检查所有邮件(此代码中不存在)

imap_server = imaplib.IMAP4_SSL("server",993)
imap_server.login(username, password)
imap_server.select("test")
result, data = imap_server.uid('search', None, "UnSeen")
latest_email_uid = data[0].split()[-1]
result, data = imap_server.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = data[0][1]
raw_email=str(raw_email, 'UTF8')
msg = email.message_from_string(raw_email)

I walk in the mail, if I find some html I decode it from base64 and send to beautifulsoup. 我走进邮件,如果我找到一些HTML我从base64解码它并发送到beautifulsoup。 After that I print it with utf-8 conversion. 之后我用utf-8转换打印它。 If I replace encode.('utf-8') by latin-1 I have also special char. 如果我用latin-1替换encode。('utf-8')我也有特殊字符。

if msg.is_multipart(): 
    for part in msg.walk():
        if part.get_content_type() == 'text/html':
            attachment= (part.get_payload(decode=1))
            soup=BeautifulSoup(attachment)
            print (soup.prettify().encode('utf-8'))
        else:
            print ("No HTML")

I tried to encode,decode in a lot a charset without having something nice. 我尝试编码,解码很多字符集,没有一些不错的东西。 I have also tried with base64.b64decode(text).decode('utf-16') but still have the same \\xc3\\xa9 我也试过使用base64.b64decode(text).decode('utf-16')但仍然有相同的\\xc3\\xa9

You see the special characters because you are encoding to UTF-8 or Latin-1: 您看到特殊字符, 因为您编码为UTF-8或Latin-1:

>>> print('\xe9')
é
>>> print('\xe9'.encode('utf8'))
b'\xc3\xa9'
>>> print('\xe9'.encode('latin1'))
b'\xe9'
>>> print('Hello world!\n'.encode('utf8'))
b'Hello world!\n'

When printing a bytes literal, Python shows the repr() representation of the value, which replaces any byte that does not represent a printable ASCII codepoint with the \\x.. escape sequence; 当打印字节文字,Python中示出了repr()的值,它取代表示与一个可打印的ASCII码点的任何字节的表示\\x..转义序列; some are replaced with the shorter two-character escapes, such as \\r and \\n . 有些被替换为较短的双字符转义符,例如\\r\\n This makes the representation both re-usable as a Python bytes literal and more easily logged to files and terminals not set up for international character sets. 这使得该表示既可以作为Python字节文本重用,又更容易记录到未设置为国际字符集的文件和终端。

print() handles encoding for you. print()为您处理编码。 Just print the .prettify() output directly . 只需直接打印.prettify()输出即可

If printing Unicode to your terminal or console does not work, and instead raises a UnicodeDecodeError , your terminal or console is not configured to handle Unicode text properly. 如果将Unicode打印到终端或控制台不起作用,而是引发UnicodeDecodeError ,则终端或控制台未配置为正确处理Unicode文本。 Consult the PrintFail Python Wiki page to troubleshoot. 请参阅PrintFail Python Wiki页面进行故障排除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM