简体   繁体   English

Python 2.7 BeautifulSoup电子邮件抓取在完整数据库结束之前停止

[英]Python 2.7 BeautifulSoup email scraping stops before end of full database

Hope you are all well! 希望你一切都好! I'm new and using Python 2.7! 我是新手,正在使用Python 2.7! I'm tring to extract emails from a public available directory website that does not seems to have API: this is the site: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search 我正试图从似乎没有API的公共可用目录网站中提取电子邮件:这是该网站: http : //www.tecomdirectory.com/companies.php? segment=&activity=&search=&search=category&submit=Search
, the code stop gathering email where on the page at the bottom where it says "load more"! ,代码将停止收集电子邮件,该电子邮件位于页面底部“显示更多内容”的位置! Here is my code: 这是我的代码:

import requests
import re
from bs4 import BeautifulSoup
file_handler = open('mail.txt','w')

soup  = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search').content)
tags = soup('a') 
list_new =[]
for tag in tags:
    if (re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>',('%s'%tag))): list_new = list_new +(re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', ('%s'%tag)))

for x in list_new:
    file_handler.write('%s\n'%x)
file_handler.close()

How can i make sure that the code goes till the end of the directory and does not stop where it shows load more? 我如何确保代码到达目录末尾并且不会在显示更多负载的地方停下来? Thanks. 谢谢。 Warmest regards 温馨问候

You just need to post some data, in particular incrementing group_no to simulate clicking the load more button: 您只需要发布一些数据,尤其是递增group_no即可模拟单击“加载更多”按钮:

from bs4 import BeautifulSoup
import requests

# you can set whatever here to influence the results
data = {"group_no": "1",
        "search": "category",
        "segment": "",
        "activity": "",
        "retail": "",
        "category": "",
        "Bpark": "",
        "alpha": ""} 

post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"

with requests.Session() as s:
    soup = BeautifulSoup(
        s.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content,
        "html.parser")
    print([a["href"] for a in soup.select("a[href^=mailto:]")])
    for i in range(1, 5):
        data["group_no"] = str(i)
        soup = BeautifulSoup(s.post(post, data=data).content, "html.parser")
        print([a["href"] for a in soup.select("a[href^=mailto:]")])

To go until the end, you can loop until the post returns no html, that signifies we cannot load any more pages: 要走到最后,您可以循环播放,直到帖子没有返回html为止,这表明我们无法再加载任何页面:

def yield_all_mails():
    data = {"group_no": "1",
            "search": "category",
            "segment": "",
            "activity": "",
            "retail": "",
            "category": "",
            "Bpark": "",
            "alpha": ""}

    post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
    start = "http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search"
    with requests.Session() as s:
        resp = s.get(start)
        soup = BeautifulSoup(s.get(start).content, "html.parser")
        yield (a["href"] for a in soup.select("a[href^=mailto:]"))
        i = 1
        while resp.content.strip():
            data["group_no"] = str(i)
            resp = s.post(post, data=data)
            soup = BeautifulSoup(resp.content, "html.parser")
            yield (a["href"] for a in soup.select("a[href^=mailto:]"))
            i += 1

So if we ran the function like below setting "alpha": "Z" to just iterate over the Z's: 因此,如果我们像下面那样运行函数,将"alpha": "Z"为仅在Z上进行迭代:

from itertools import chain
for mail in chain.from_iterable(yield_all_mails()):
    print(mail)

We would get: 我们会得到:

mailto:info@10pearls.com
mailto:fady@24group.ae
mailto:pepe@2heads.tv
mailto:2interact@2interact.us
mailto:gc@worldig.com
mailto:marilyn.pais@3i-infotech.com
mailto:3mgulf@mmm.com
mailto:venkat@4gid.com
mailto:info@4power.biz
mailto:info@4sstudyabroad.com
mailto:fouad@622agency.com
mailto:sahar@7quality.com
mailto:mike.atack@8ack.com
mailto:zyara@emirates.net.ae
mailto:aokasha@zynx.com

Process finished with exit code 0

You should put a sleep in between requests so you don't hammer the server and get yourself blocked. 您应该在两次请求之间进行睡眠,以免影响服务器并阻止自己。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM