简体   繁体   English

Python decode_header 拆分原始字符串

[英]Python decode_header splits the original string

Using Python 3, I'm trying to parse e-mails from an mbox file.使用 Python 3,我试图从mbox文件解析电子邮件。

for message in mailbox.mbox('file'):
    sender = message['From']
    c = decode_header(sender)

The raw e-mail has this unique From: header原始电子邮件有这个独特的From:标题

From: "=?UTF-8?Q?Mark_from_Site?=" <info@site.com>

Anyway, c is无论如何, c

[(b'"', None), (b'Mark from Site', 'utf-8'), (b'" <info@site.com>', None)]

In this case, the line is unexpectedly split following the quotation marks " in multiple elements.在这种情况下,该行在多个元素中的引号"之后意外拆分。

Handling this may be cumbersome, because there may be an undefined number of elements (not always 3 like above) in the list, according to the number of " , and there may also be other causes for splitting.处理,这可能是麻烦的,因为有可能元件的列表中的一个未定义的数目(不总是像3以上),根据的数量" ,并且还可能存在其它原因分裂。

When there is no string encoding (that is: when the header is pure ascii ), there is no split and c is "Mark from Site" <info@site.com> .当没有字符串编码时(即:当标头为纯ascii ),没有拆分并且c"Mark from Site" <info@site.com>

Is there a way to avoid this splitting also for non- ascii encodings?对于非ascii编码,有没有办法避免这种拆分?

Or, otherwise, how to correctly parse this kind of headers?或者,否则,如何正确解析这种标题?

What about doing the simplest thing, ie.做最简单的事情怎么样,即。 converting all parts to Unicode and then glueing them together:将所有部分转换为 Unicode,然后将它们粘合在一起:

from = ''.join(t[0].decode(t[1] if t[1] else 'UTF-8') for t in decode_header(sender))

You can have the email.header module handle encoding for you by creating an instance of email.header.Header with your string and the charset it should be encoded in.您可以通过使用您的字符串和应该编码的字符集创建email.header.Header实例, email.header.Header email.header模块为您处理编码。

from email.header import Header

for message in mailbox.mbox('file'):
    sender = Header(message['From'], "utf-8")
    c = decode_header(sender)
str(email.header.make_header(email.header.decode_header(encoded_string)))

Not too obvious, but this should decode and correctly rebuild the header and convert it to a string.不太明显,但这应该解码并正确重建标头并将其转换为字符串。 I also found this somewhere here on StackOverflow.我也在 StackOverflow 上的某个地方找到了这个。

Not sure if it's the most elegant way, but seems to work for me.不确定这是否是最优雅的方式,但似乎对我有用。

See https://docs.python.org/3/library/email.header.html for the documentation of these functions.有关这些函数的文档,请参阅https://docs.python.org/3/library/email.header.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM