简体   繁体   English

将编码的字符串转换为普通的可打印字符

[英]convert encoded strings to normal printable characters

I am attempting to extract details from MBOX file, and have created the following sample program.我试图从 MBOX 文件中提取详细信息,并创建了以下示例程序。

This works but some of the headers print encoded strings such as这有效,但一些标题打印编码的字符串,例如

 =?UTF-8?B?QVJNIE1hY3MgYXJlIGNvbWluZywgdGhyZWUgeWVhcnMgYWZ0ZXIgQXBwbGU=?=
 =?UTF-8?B?4oCZcyBhdHRpdHVkZSBjaGFuZ2U=?=

I gather "=?UTF-8?B?"我收集“=?UTF-8?B?” indicates Base64 encoding, so I guess there must be a 2 step process to convert from Base64 then from UTF-8.表示 Base64 编码,所以我想必须有一个 2 步过程才能从 Base64 转换为 UTF-8。

Can anyone point me to a method to convert these strings to normal printable characters?任何人都可以指出一种将这些字符串转换为普通可打印字符的方法吗?

#! /usr/bin/env python3
#import locale
#2020-02-27

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.header import Header

for message in mailbox.mbox('~/temp/Inbox'):
    subject = message['subject']
    sender = message['from']
    ddate = message['Delivery-date'].
    print(subject, sender)

I have made some progress - if I strip off the我取得了一些进展——如果我去掉

=?UTF-8?B?

?=  

then call base64.b64decode() I get readable text然后调用base64.b64decode()我得到可读文本

the string above becomes b'\\xe2\\x80\\x99s attitude change'上面的字符串变成了 b'\\xe2\\x80\\x99s 姿态变化'

=?UTF-8?B?QVJNIE1hY3MgYXJlIGNvbWluZywgdGhyZWUgeWVhcnMgYWZ0ZXIgQXBwbGU=?=

becomes b"ARM Macs are coming, three years after Apple'"变成了“ARM Macs 来了,比苹果晚了三年”

Concatenating these together gives the Subject将这些连接在一起给出主题

ARM Macs are coming, three years after Apple's attitude change苹果态度转变三年后,ARM Mac 即将问世

Does this work?这行得通吗?

#! /usr/bin/env python3
"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.header import Header

for message in mailbox.mbox('~/temp/Inbox'):
    subject = message['subject']
    sender = message['from']
    ddate = message['Delivery-date'].
    print(subject.decode('utf-8', 'ignore'), sender.decode('utf-8', 'ignore'))

Online Code link在线代码链接

I wrote a function to convert UTF-8 Base64 or Quoted Printable strings, although I am surprised that I could't find an existing method.我编写了一个函数来转换 UTF-8 Base64 或 Quoted Printable 字符串,尽管我很惊讶找不到现有的方法。

#! /usr/bin/env python3
#import locale
#2020-02-27

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
import base64, quopri

def bdecode(s):
    """
    Convert UTF-8 Base64 or Quoted Printable strings to str
    """
    outstr = ""
    if s is None:
        return outstr
    for ss in s.splitlines():   # split multiline strings
        sss = ss.strip()
        for sssp in sss.split(' '):   # split multiple strings
            if sssp.upper().startswith('=?UTF-8?B?'):
                bbb = base64.b64decode(sssp[10:-2])
                outstr+=bbb.decode("utf-8")
            elif sssp.upper().startswith('=?UTF-8?Q?'):
                bbb = quopri.decodestring(sssp[10:-2])
                outstr+=bbb.decode("utf-8")
            else:
                outstr+=sssp
    return outstr

for message in mailbox.mbox('~/temp/Inbox'):
    subject = message['subject']
    print(bdecode(subject))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM