Python 2.7 BeautifulSoup，电子邮件抓取

Question

Hope you are all well. 希望你一切都好。 I'm new in Python and using python 2.7. 我是使用Python 2.7的Python新手。

I'm trying to extract only the mailto from this public website business directory: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search 我正在尝试仅从此公共网站业务目录中提取mailto： http : //www.tecomdirectory.com/companies.php? segment=&activity=&activity=&search=category&submit=Search
the mails i'm looking for are the emails mentioned in every widget from az in the full directory. 我要查找的邮件是完整目录中az中每个小部件中提到的电子邮件。 This directory does not have an API unfortunately. 不幸的是，该目录没有API。 I'm using BeautifulSoup, but with no success so far. 我正在使用BeautifulSoup，但到目前为止没有成功。
here is mycode: 这是mycode：

import urllib
from bs4 import BeautifulSoup
website = raw_input("Type website here:>\n")
html = urllib.urlopen('http://'+ website).read()
soup = BeautifulSoup(html)

tags = soup('a') 

for tag in tags:
    print tag.get('href', None)

what i get is just the website of the actual website , like http://www.tecomdirectory.com with other href rather then the mailto or websites in the widgets. 我得到的只是实际网站的网站，例如带有其他href的http://www.tecomdirectory.com ，而不是小部件中的mailto或网站。 i also tried replacing soup('a') with soup ('target'), but no luck! 我还尝试用汤（“目标”）代替汤（“ a”），但没有运气！ Can anybody help me please? 有人可以帮我吗？

Answer 1

You cannot just find every anchor, you need to specifically look for "mailto:" in the href, you can use a css selector a[href^=mailto:] which finds anchor tags that have a href starting with mailto: : 您不仅可以找到每个锚，还需要在href中专门查找“ mailto：”，可以使用css选择器a[href^=mailto:]查找具有以mailto:开头的href的 锚标签：

import requests

soup  = BeautifulSoup(requests.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content)

print([a["href"] for a in soup.select("a[href^=mailto:]")])

Or extract the text: 或提取文本：

print([a.text for a in soup.select("a[href^=mailto:]")])

Using find_all("a") you would need to use a regex to achieve the same: 使用find_all("a")您将需要使用正则表达式来实现相同目的：

import re

find_all("a", href=re.compile(r"^mailto:"))

Python 2.7 BeautifulSoup，电子邮件抓取

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-23 13:34:31

Python 2.7 BeautifulSoup，电子邮件抓取

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-23 13:34:31

解决方案1
1 已采纳 2016-09-23 13:34:31