简体   繁体   中英

Decoding MIME email from Gmail API - \r\n and 3D - Python

I am currently using the Gmail API to read in some HTML emails in Python. I've decoded their body using:

base64.urlsafe_b64decode

After printing out the resulting HTML email, "\r\n" and "3D" are scattered around the HTML. I can't remove the "\r\n" because the \ and r and \ and n register as different characters (?) and I'm not sure where the "3D" comes from.

Is there something wrong with how I'm decoding it?

Here is the code:

results = service.users().messages().list(userId='me', q = 'is: unread').execute()

for index in range(len(results['messages'])):
    message = service.users().messages().get(userId='me', id=results['messages'][index]['id'], format='raw').execute()

    msg_str = base64.urlsafe_b64decode(message['raw'].encode('UTF-8'))

    mime_msg = email.message_from_string(str(msg_str))

    print(mime_msg)

    service.users().messages().modify(userId='me', id=results['messages'][index]['id'], body = {'removeLabelIds': ['UNREAD']}).execute() # mark message as read

I found the solution - I stopped using the email library from Python, and cast msg_str to a string (it is of type bytes). From there, I simply deleted '\\r\\n' from the string and replaced '=3D' with '=' .

This is not a great solution, rather use something like

for email_part in message.walk(): 
    part_data = email_part.get_payload(decode=True) 

Where message is a Python email.message.Message obj. Then perhaps uses something like BeautifulSoup to effectively analyse the HTML. Hope that helps!

maksel's solution worked for me provided str.decode('utf-8') was set. The original code encoded instead of decoded the byte-string.

Hence, under python 3.7 we can replace as follows:

msg = msg.replace('\r\n', '').replace('=3D', '=')

Be wary as this solution did not work for all html tags in my case.

I might be bit late. Some of the mentioned solutions worked. But to help others who are visiting here I thought to post this answer as it looks bit cleaner.

When building the mail object use policy=email.policy.default . This removes the mentioned =3D , \r\n etc.

mailobject = email.message_from_string(msg_str,  policy=email.policy.default)

If on Python 3.6+ you can use get_body and get_content methods.

if mailobject.is_multipart():
    body = mailobject.get_body(('html',))
else:
    body = mailobject.get_body(('plain',))

if body:
    body = body.get_content()

print(body)

Above codes are very minimal just to suffice the answer. Here we assumed its either just plain or html. Remember to cater for other situations when handling emails.

An Additional Unrelated Tip:

As it is an encoding problem this answer also works with other similar situations. Like when trying to parse AWS SES emails pushed to s3 forwarding using an AWS Lambda Function(Python). I had to mention it here as this same issue occurred to me while trying to play with those.

In such case use it like this

s3_file = object_s3['Body'].read()
mailobject = email.message_from_string(s3_file.decode('utf-8'),  policy=email.policy.default)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM