[英]Python: convert strings containing unicode code point back into normal characters
[英]convert encoded strings to normal printable characters
我试图从 MBOX 文件中提取详细信息,并创建了以下示例程序。
这有效,但一些标题打印编码的字符串,例如
=?UTF-8?B?QVJNIE1hY3MgYXJlIGNvbWluZywgdGhyZWUgeWVhcnMgYWZ0ZXIgQXBwbGU=?=
=?UTF-8?B?4oCZcyBhdHRpdHVkZSBjaGFuZ2U=?=
我收集“=?UTF-8?B?” 表示 Base64 编码,所以我想必须有一个 2 步过程才能从 Base64 转换为 UTF-8。
任何人都可以指出一种将这些字符串转换为普通可打印字符的方法吗?
#! /usr/bin/env python3
#import locale
#2020-02-27
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
from email.header import Header
for message in mailbox.mbox('~/temp/Inbox'):
subject = message['subject']
sender = message['from']
ddate = message['Delivery-date'].
print(subject, sender)
我取得了一些进展——如果我去掉
=?UTF-8?B?
?=
然后调用base64.b64decode()
我得到可读文本
上面的字符串变成了 b'\\xe2\\x80\\x99s 姿态变化'
=?UTF-8?B?QVJNIE1hY3MgYXJlIGNvbWluZywgdGhyZWUgeWVhcnMgYWZ0ZXIgQXBwbGU=?=
变成了“ARM Macs 来了,比苹果晚了三年”
将这些连接在一起给出主题
苹果态度转变三年后,ARM Mac 即将问世
这行得通吗?
#! /usr/bin/env python3
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
from email.header import Header
for message in mailbox.mbox('~/temp/Inbox'):
subject = message['subject']
sender = message['from']
ddate = message['Delivery-date'].
print(subject.decode('utf-8', 'ignore'), sender.decode('utf-8', 'ignore'))
我编写了一个函数来转换 UTF-8 Base64 或 Quoted Printable 字符串,尽管我很惊讶找不到现有的方法。
#! /usr/bin/env python3
#import locale
#2020-02-27
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
import base64, quopri
def bdecode(s):
"""
Convert UTF-8 Base64 or Quoted Printable strings to str
"""
outstr = ""
if s is None:
return outstr
for ss in s.splitlines(): # split multiline strings
sss = ss.strip()
for sssp in sss.split(' '): # split multiple strings
if sssp.upper().startswith('=?UTF-8?B?'):
bbb = base64.b64decode(sssp[10:-2])
outstr+=bbb.decode("utf-8")
elif sssp.upper().startswith('=?UTF-8?Q?'):
bbb = quopri.decodestring(sssp[10:-2])
outstr+=bbb.decode("utf-8")
else:
outstr+=sssp
return outstr
for message in mailbox.mbox('~/temp/Inbox'):
subject = message['subject']
print(bdecode(subject))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.