简体   繁体   中英

Decode 'quoted-printable' in python

I want to decode 'quoted-printable' encoded strings in Python, but I seem to be stuck at a point.

I fetch certain mails from my gmail account based on the following code:

import imaplib
import email
import quopri


mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('mail@gmail.com', '*******')
mail.list()

mail.select('"[Gmail]/All Mail"') 



typ, data = mail.search(None, 'SUBJECT', '"{}"'.format('123456'))

data[0].split()

print(data[0].split())

for e_mail in data[0].split():
    typ, data = mail.fetch('{}'.format(e_mail.decode()),'(RFC822)')
    raw_mail = data[0][1]
    email_message = email.message_from_bytes(raw_mail)
    if email_message.is_multipart():
        for part in email_message.walk():
            if part.get_content_type() == 'text/plain':
                if part.get_content_type() == 'text/plain':
                    body = part.get_payload()
                    to = email_message['To']

                    utf = quopri.decodestring(to)

                    text = utf.decode('utf-8')
                    print(text)
.
.
.

If I print 'to' for example, the result is this if the 'to' has characters like é,á,ó...:

=?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=

I can decode the ' body ' quoted-printable encoded string successfully using the quopri library as such:

quopri.decodestring(sometext).decode('utf-8') 

But the same logic doesn't work for other parts of the e-mail, such as the to, from, subject.

Anyone knows a hint?

The subject string you have is not pure quoted printable encoding (ie not standard quopri ) — it is a mixture of base64 and quoted printable. You can decode it with the standard library:

from email.header import decode_header

result = decode_header('=?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=')
# ^ the result is a list of tuples of the form [(decoded_bytes, encoding),]

for data, encoding in result:
    print(data.decode(encoding))
    # outputs: Péter Petőcz

You are trying to decode latin characters using utf-8. The output you are getting is base64. It reads:

No printable characters found, try another source charset, or upload your data as a file for binary decoding.

Give this a try. Python: Converting from ISO-8859-1/latin1 to UTF-8

This solves it:

from email.header import decode_header
      def mail_header_decoder(header):
            if header != None:
                mail_header_decoded = decode_header(header)
                l=[]  
                header_new=[]
                for header_part in mail_header_decoded: 
                    l.append(header_part[1])

                if all(item == None for item in l):
                    # print(header)
                    return header
                else:
                    for header_part in mail_header_decoded:
                        header_new.append(header_part[0].decode())
                    header_new = ''.join(header_new) # convert list to string
                    # print(header_new)
                    return header_new

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM