简体   繁体   English

imap_tools 从电子邮件中抓取链接需要很长时间

[英]imap_tools Taking Long Time to Scrape Links from Emails

I am using imap_tools to get links from emails.我正在使用 imap_tools 从电子邮件中获取链接。 The emails are very small with very little text, graphics, etc. There are also not many, around 20-40 spread through the day.电子邮件非常小,只有很少的文字、图形等。也不多,一天大约有 20-40 封。

When a new email arrives it takes between 10 and 25 seconds to scrape the link.当收到新电子邮件时,抓取链接需要 10 到 25 秒。 This seems very long.这似乎很长。 I would have expected it to be less than 2 seconds and speed is important.我原以为它不到 2 秒,而且速度很重要。

Nb.铌。 it is a shared mailbox and I cannot simply fetch unseeen emails because often other users will have opened emails before the scraper gets to them.它是一个共享邮箱,我不能简单地获取看不见的电子邮件,因为其他用户通常会在抓取工具到达之前打开电子邮件。

Can anyone see what the issue is?任何人都可以看到问题是什么?

import pandas as pd
from imap_tools import MailBox, AND
import re, time, datetime, os
from config import email, password

uids = []
yahooSmtpServer = "imap.mail.yahoo.com"
data = {
    'today': str(datetime.datetime.today()).split(' ')[0],
    'uids': []
    }
while True:
    while True:
        try:
            client = MailBox(yahooSmtpServer).login(email, password, 'INBOX')
            try:
                if not data['today'] == str(datetime.datetime.today()).split(' ')[0]:
                    data['today'] = str(datetime.datetime.today()).split(' ')[0]
                    data['uids'] = []
                ds = str(datetime.datetime.today()).split(' ')[0].split('-')
                msgs = client.fetch(AND(date_gte=datetime.date.today()))
                for msg in msgs:
                    links = []
                    if str(datetime.datetime.today()).split(' ')[0] == str(msg.date).split(' ')[0] and not msg.uid in data['uids']:
                        mail = msg.html
                        if 'order' in mail and not 'cancel' in mail:
                            for i in re.findall(r'(https?://[^\s]+)', mail):
                                if 'pick' in i:
                                    link = i.replace('"', "")
                                    link = link.replace('<', '>').split('>')[0]
                                    print(link)
                                    links.append(link)
                                    break
                        data['uids'].append(msg.uid)
                        scr_links = pd.DataFrame({'Links': links})
                        scr_links.to_csv('Links.csv', mode='a', header=False, index=False)
                        time.sleep(0.5)
            except Exception as e:
                print(e)
                pass
            client.logout()
            time.sleep(5)
        except Exception as e:
            print(e)
            print('sleeping for 5 sec')
            time.sleep(1)

I think this is email server throttle timeout.我认为这是电子邮件服务器限制超时。

Try to see IMAP IDLE.尝试查看 IMAP IDLE。

imap_tools can not do it, but you may want to implement it :D imap_tools 做不到,但你可能想实现它:D

https://github.com/ikvk/imap_tools/issues/93 https://github.com/ikvk/imap_tools/issues/93

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM