Python 2.7 BeautifulSoup电子邮件抓取在完整数据库结束之前停止

Question

希望你一切都好！ 我是新手，正在使用Python 2.7！ 我正试图从似乎没有API的公共可用目录网站中提取电子邮件：这是该网站： http : //www.tecomdirectory.com/companies.php? segment=&activity=&search=&search=category&submit=Search
，代码将停止收集电子邮件，该电子邮件位于页面底部“显示更多内容”的位置！ 这是我的代码：

import requests
import re
from bs4 import BeautifulSoup
file_handler = open('mail.txt','w')

soup  = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search').content)
tags = soup('a') 
list_new =[]
for tag in tags:
    if (re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>',('%s'%tag))): list_new = list_new +(re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', ('%s'%tag)))

for x in list_new:
    file_handler.write('%s\n'%x)
file_handler.close()

我如何确保代码到达目录末尾并且不会在显示更多负载的地方停下来？ 谢谢。 温馨问候

Answer 1

您只需要发布一些数据，尤其是递增group_no即可模拟单击“加载更多”按钮：

from bs4 import BeautifulSoup
import requests

# you can set whatever here to influence the results
data = {"group_no": "1",
        "search": "category",
        "segment": "",
        "activity": "",
        "retail": "",
        "category": "",
        "Bpark": "",
        "alpha": ""} 

post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"

with requests.Session() as s:
    soup = BeautifulSoup(
        s.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content,
        "html.parser")
    print([a["href"] for a in soup.select("a[href^=mailto:]")])
    for i in range(1, 5):
        data["group_no"] = str(i)
        soup = BeautifulSoup(s.post(post, data=data).content, "html.parser")
        print([a["href"] for a in soup.select("a[href^=mailto:]")])

要走到最后，您可以循环播放，直到帖子没有返回html为止，这表明我们无法再加载任何页面：

def yield_all_mails():
    data = {"group_no": "1",
            "search": "category",
            "segment": "",
            "activity": "",
            "retail": "",
            "category": "",
            "Bpark": "",
            "alpha": ""}

    post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
    start = "http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search"
    with requests.Session() as s:
        resp = s.get(start)
        soup = BeautifulSoup(s.get(start).content, "html.parser")
        yield (a["href"] for a in soup.select("a[href^=mailto:]"))
        i = 1
        while resp.content.strip():
            data["group_no"] = str(i)
            resp = s.post(post, data=data)
            soup = BeautifulSoup(resp.content, "html.parser")
            yield (a["href"] for a in soup.select("a[href^=mailto:]"))
            i += 1

因此，如果我们像下面那样运行函数，将"alpha": "Z"为仅在Z上进行迭代：

from itertools import chain
for mail in chain.from_iterable(yield_all_mails()):
    print(mail)

我们会得到：

mailto:info@10pearls.com
mailto:fady@24group.ae
mailto:pepe@2heads.tv
mailto:2interact@2interact.us
mailto:gc@worldig.com
mailto:marilyn.pais@3i-infotech.com
mailto:3mgulf@mmm.com
mailto:venkat@4gid.com
mailto:info@4power.biz
mailto:info@4sstudyabroad.com
mailto:fouad@622agency.com
mailto:sahar@7quality.com
mailto:mike.atack@8ack.com
mailto:zyara@emirates.net.ae
mailto:aokasha@zynx.com

Process finished with exit code 0

您应该在两次请求之间进行睡眠，以免影响服务器并阻止自己。

Python 2.7 BeautifulSoup电子邮件抓取在完整数据库结束之前停止

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-23 19:51:59

Python 2.7 BeautifulSoup电子邮件抓取在完整数据库结束之前停止

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-23 19:51:59

解决方案1
1 已采纳 2016-09-23 19:51:59