简体   繁体   中英

Python 2.7 BeautifulSoup , email scraping

Hope you are all well. I'm new in Python and using python 2.7.

I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from az in the full directory. This directory does not have an API unfortunately. I'm using BeautifulSoup, but with no success so far.
here is mycode:

import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)

tags = soup('a') 

for tag in tags:
    print tag.get('href', None)

what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. i also tried replacing soup('a') with soup ('target'), but no luck! Can anybody help me please?

You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto: :

import requests

soup  = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)

print([a["href"] for a in soup.select("a[href^=mailto:]")])

Or extract the text:

print([a.text for a in soup.select("a[href^=mailto:]")])

Using find_all("a") you would need to use a regex to achieve the same:

import re

find_all("a", href=re.compile(r"^mailto:"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM