简体   繁体   English

如何在不使用Python触及附件的情况下有效地解析电子邮件

[英]How to efficiently parse emails without touching attachments using Python

I'm playing with Python imaplib (Python 2.6) to fetch emails from GMail. 我正在使用Python imaplib(Python 2.6)来从GMail获取电子邮件。 Everything I fetch an email with method http://docs.python.org/library/imaplib.html#imaplib.IMAP4.fetch I get whole email. 我用方法http://docs.python.org/library/imaplib.html#imaplib.IMAP4.fetch获取电子邮件的所有内容我收到完整的电子邮件。 I need only text part and also parse names of attachments, without downloading them. 我只需要文本部分,也可以解析附件的名称,而无需下载它们。 How this can be done? 怎么做到这一点? I see that emails returned by GMail follow the same format that browsers send to HTTP servers. 我看到GMail返回的电子邮件遵循浏览器发送到HTTP服务器的相同格式。

Take a look at this recipe: http://code.activestate.com/recipes/498189/ 看看这个食谱: http//code.activestate.com/recipes/498189/

I adapted it slightly to print the From, Subject, Date, name of attachments, and message body (just plaintext for now -- its trivial to add html messages). 我稍微调整了它以打印From,Subject,Date,附件名称和消息正文(现在只是纯文本 - 添加HTML消息很简单)。

I used the Gmail pop3 server in this case, but it should work for IMAP as well. 在这种情况下我使用了Gmail pop3服务器,但它也适用于IMAP。

import poplib, email, string

mailserver = poplib.POP3_SSL('pop.gmail.com')
mailserver.user('recent:YOURUSERNAME') #use 'recent mode'
mailserver.pass_('YOURPASSWORD') #consider not storing in plaintext!

numMessages = len(mailserver.list()[1])
for i in reversed(range(numMessages)):
    message = ""
    msg = mailserver.retr(i+1)
    str = string.join(msg[1], "\n")
    mail = email.message_from_string(str)

    message += "From: " + mail["From"] + "\n"
    message += "Subject: " + mail["Subject"] + "\n"
    message += "Date: " + mail["Date"] + "\n"

    for part in mail.walk():
        if part.is_multipart():
            continue
        if part.get_content_type() == 'text/plain':
            body = "\n" + part.get_payload() + "\n"
        dtypes = part.get_params(None, 'Content-Disposition')
        if not dtypes:
            if part.get_content_type() == 'text/plain':
                continue
            ctypes = part.get_params()
            if not ctypes:
                continue
            for key,val in ctypes:
                if key.lower() == 'name':
                    message += "Attachment:" + val + "\n"
                    break
            else:
                continue
        else:
            attachment,filename = None,None
            for key,val in dtypes:
                key = key.lower()
                if key == 'filename':
                    filename = val
                if key == 'attachment':
                    attachment = 1
            if not attachment:
                continue
            message += "Attachment:" + filename + "\n"
        if body:
            message += body + "\n"
    print message
    print

This should be enough to get you heading in the right direction. 这应该足以让你朝着正确的方向前进。

You can get only the plain text of the email by doing something like: 通过执行以下操作,您只能获得电子邮件的纯文本:

connection.fetch(id, '(BODY[1])')

For the gmail messages I've seen, section 1 has the plaintext, including multipart junk. 对于我见过的gmail消息,第1节有明文,包括多部分垃圾。 This may not be so robust. 这可能不那么强大。

I don't know how to get the name of the attachment without all of it. 我不知道如何在没有全部的情况下获得附件的名称。 I haven't tried using partials. 我没有尝试过使用partials。

I'm afraid you're out of luck. 我怕你运气不好。 According to this post , there are only two parts to the email - the header and the body. 根据这篇文章 ,电子邮件只有两个部分 - 标题和正文。 The body is where the attachments are if there are any and you have to download the whole body before extracting only the message text. 身体是附件所在的位置,如果有任何附件,则必须在仅提取消息文本之前下载整个身体。 The info about the FETCH command found here also supports this opinion. 此处找到的有关FETCH命令的信息也支持此观点。 While it says you can extract partials of the body, these are specified in terms of octets which doesn't really help. 虽然它说你可以提取身体的部分,但是这些都是用八位字节来指定的,这并没有真正帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM