简体   繁体   中英

convert encoded strings to normal printable characters

I am attempting to extract details from MBOX file, and have created the following sample program.

This works but some of the headers print encoded strings such as

 =?UTF-8?B?QVJNIE1hY3MgYXJlIGNvbWluZywgdGhyZWUgeWVhcnMgYWZ0ZXIgQXBwbGU=?=
 =?UTF-8?B?4oCZcyBhdHRpdHVkZSBjaGFuZ2U=?=

I gather "=?UTF-8?B?" indicates Base64 encoding, so I guess there must be a 2 step process to convert from Base64 then from UTF-8.

Can anyone point me to a method to convert these strings to normal printable characters?

#! /usr/bin/env python3
#import locale
#2020-02-27

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.header import Header

for message in mailbox.mbox('~/temp/Inbox'):
    subject = message['subject']
    sender = message['from']
    ddate = message['Delivery-date'].
    print(subject, sender)

I have made some progress - if I strip off the

=?UTF-8?B?

?=  

then call base64.b64decode() I get readable text

the string above becomes b'\\xe2\\x80\\x99s attitude change'

=?UTF-8?B?QVJNIE1hY3MgYXJlIGNvbWluZywgdGhyZWUgeWVhcnMgYWZ0ZXIgQXBwbGU=?=

becomes b"ARM Macs are coming, three years after Apple'"

Concatenating these together gives the Subject

ARM Macs are coming, three years after Apple's attitude change

Does this work?

#! /usr/bin/env python3
"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.header import Header

for message in mailbox.mbox('~/temp/Inbox'):
    subject = message['subject']
    sender = message['from']
    ddate = message['Delivery-date'].
    print(subject.decode('utf-8', 'ignore'), sender.decode('utf-8', 'ignore'))

Online Code link

I wrote a function to convert UTF-8 Base64 or Quoted Printable strings, although I am surprised that I could't find an existing method.

#! /usr/bin/env python3
#import locale
#2020-02-27

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
import base64, quopri

def bdecode(s):
    """
    Convert UTF-8 Base64 or Quoted Printable strings to str
    """
    outstr = ""
    if s is None:
        return outstr
    for ss in s.splitlines():   # split multiline strings
        sss = ss.strip()
        for sssp in sss.split(' '):   # split multiple strings
            if sssp.upper().startswith('=?UTF-8?B?'):
                bbb = base64.b64decode(sssp[10:-2])
                outstr+=bbb.decode("utf-8")
            elif sssp.upper().startswith('=?UTF-8?Q?'):
                bbb = quopri.decodestring(sssp[10:-2])
                outstr+=bbb.decode("utf-8")
            else:
                outstr+=sssp
    return outstr

for message in mailbox.mbox('~/temp/Inbox'):
    subject = message['subject']
    print(bdecode(subject))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM