[英]extract e-mails from multiple pages in a website and list it
I would like to extract e-mails of exhibitors from an exhibition website using python.我想使用python从展览网站中提取参展商的电子邮件。 the page contains a hypertext of the exhibitors.该页面包含参展商的超文本。 when the exhibitor name is clicked you will find the exhibitors profile that includes its e-mail.单击参展商名称后,您将找到包含其电子邮件的参展商资料。
You can find the website here:您可以在此处找到该网站:
https://www.medica-tradefair.com/cgi-bin/md_medica/lib/pub/tt.cgi/Exhibitor_index_A-Z.html?oid=80398&lang=2&ticket=g_u_e_s_t https://www.medica-tradefair.com/cgi-bin/md_medica/lib/pub/tt.cgi/Exhibitor_index_A-Z.html?oid=80398&lang=2&ticket=g_u_e_s_t
How can I do this using python, please?请问我如何使用python来做到这一点? Thank you in advance先感谢您
You can grab all the links to the exhibitors, then iterate through those and pull the email for each of them:您可以获取所有参展商的链接,然后遍历这些链接并为每个人提取电子邮件:
import requests
import bs4
url = 'https://www.medica-tradefair.com/cgi-bin/md_medica/lib/pub/tt.cgi/Exhibitor_index_A-Z.html?oid=80398&lang=2&ticket=g_u_e_s_t'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
exhibitor_links = ['https://www.medica-tradefair.com'+link['href'] for link in links if 'vis/v1/en/exhibitors' in link['href'] ]
exhibitor_links = list(set(exhibitor_links))
for link in exhibitor_links:
response = requests.get(link)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
name = soup.find('h1',{'itemprop':'name'}).text
try:
email = soup.find('a', {'itemprop':'email'}).text
except:
email = 'N/A'
print('Name: %s\tEmail: %s' %(name, email))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.