简体   繁体   中英

Python 2.7 BeautifulSoup email scraping stops before end of full database

Hope you are all well! I'm new and using Python 2.7! I'm tring to extract emails from a public available directory website that does not seems to have API: this is the site: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
, the code stop gathering email where on the page at the bottom where it says "load more"! Here is my code:

import requests
import re
from bs4 import BeautifulSoup
file_handler = open('mail.txt','w')

soup  = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search').content)
tags = soup('a') 
list_new =[]
for tag in tags:
    if (re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>',('%s'%tag))): list_new = list_new +(re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', ('%s'%tag)))

for x in list_new:
    file_handler.write('%s\n'%x)
file_handler.close()

How can i make sure that the code goes till the end of the directory and does not stop where it shows load more? Thanks. Warmest regards

You just need to post some data, in particular incrementing group_no to simulate clicking the load more button:

from bs4 import BeautifulSoup
import requests

# you can set whatever here to influence the results
data = {"group_no": "1",
        "search": "category",
        "segment": "",
        "activity": "",
        "retail": "",
        "category": "",
        "Bpark": "",
        "alpha": ""} 

post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"

with requests.Session() as s:
    soup = BeautifulSoup(
        s.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content,
        "html.parser")
    print([a["href"] for a in soup.select("a[href^=mailto:]")])
    for i in range(1, 5):
        data["group_no"] = str(i)
        soup = BeautifulSoup(s.post(post, data=data).content, "html.parser")
        print([a["href"] for a in soup.select("a[href^=mailto:]")])

To go until the end, you can loop until the post returns no html, that signifies we cannot load any more pages:

def yield_all_mails():
    data = {"group_no": "1",
            "search": "category",
            "segment": "",
            "activity": "",
            "retail": "",
            "category": "",
            "Bpark": "",
            "alpha": ""}

    post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
    start = "http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search"
    with requests.Session() as s:
        resp = s.get(start)
        soup = BeautifulSoup(s.get(start).content, "html.parser")
        yield (a["href"] for a in soup.select("a[href^=mailto:]"))
        i = 1
        while resp.content.strip():
            data["group_no"] = str(i)
            resp = s.post(post, data=data)
            soup = BeautifulSoup(resp.content, "html.parser")
            yield (a["href"] for a in soup.select("a[href^=mailto:]"))
            i += 1

So if we ran the function like below setting "alpha": "Z" to just iterate over the Z's:

from itertools import chain
for mail in chain.from_iterable(yield_all_mails()):
    print(mail)

We would get:

mailto:info@10pearls.com
mailto:fady@24group.ae
mailto:pepe@2heads.tv
mailto:2interact@2interact.us
mailto:gc@worldig.com
mailto:marilyn.pais@3i-infotech.com
mailto:3mgulf@mmm.com
mailto:venkat@4gid.com
mailto:info@4power.biz
mailto:info@4sstudyabroad.com
mailto:fouad@622agency.com
mailto:sahar@7quality.com
mailto:mike.atack@8ack.com
mailto:zyara@emirates.net.ae
mailto:aokasha@zynx.com

Process finished with exit code 0

You should put a sleep in between requests so you don't hammer the server and get yourself blocked.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM