简体   繁体   English

Python 2.7 BeautifulSoup,电子邮件抓取

[英]Python 2.7 BeautifulSoup , email scraping

Hope you are all well. 希望你一切都好。 I'm new in Python and using python 2.7. 我是使用Python 2.7的Python新手。

I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search 我正在尝试仅从此公共网站业务目录中提取mailto: http : //www.tecomdirectory.com/companies.php? segment=&activity=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from az in the full directory. 我要查找的邮件是完整目录中az中每个小部件中提到的电子邮件。 This directory does not have an API unfortunately. 不幸的是,该目录没有API。 I'm using BeautifulSoup, but with no success so far. 我正在使用BeautifulSoup,但到目前为止没有成功。
here is mycode: 这是mycode:

import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)

tags = soup('a') 

for tag in tags:
    print tag.get('href', None)

what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. 我得到的只是实际网站的网站,例如带有其他href的http://www.tecomdirectory.com ,而不是小部件中的mailto或网站。 i also tried replacing soup('a') with soup ('target'), but no luck! 我还尝试用汤(“目标”)代替汤(“ a”),但没有运气! Can anybody help me please? 有人可以帮我吗?

You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto: : 您不仅可以找到每个锚,还需要在href中专门查找“ mailto:”,可以使用css选择器a[href^=mailto:]查找具有以mailto:开头的href的 标签:

import requests

soup  = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)

print([a["href"] for a in soup.select("a[href^=mailto:]")])

Or extract the text: 或提取文本:

print([a.text for a in soup.select("a[href^=mailto:]")])

Using find_all("a") you would need to use a regex to achieve the same: 使用find_all("a")您将需要使用正则表达式来实现相同目的:

import re

find_all("a", href=re.compile(r"^mailto:"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM