简体   繁体   English

如何通过编写 python 脚本从许多不同的 html 链接中提取 Email、电话、传真号码和地址?

[英]How to extract Email, Telephone, Fax number and Address from many different html links by writing a python script?

I tried this code but it isn't working right (not extracting from all sites etc and many other issues with this).我尝试了这段代码,但它不能正常工作(没有从所有站点等中提取以及与此相关的许多其他问题)。 Need help!需要帮忙!

from bs4 import BeautifulSoup

import re

import requests

allsite = ["https://www.ionixxtech.com/", "https://sumatosoft.com", "https://4irelabs.com/", "https://www.leewayhertz.com/",
           "https://stackoverflow.com", "https://www.vardot.com/en", "http://www.clickjordan.net/", "https://vtechbd.com/"]

emails = []

tels = []

for l in allsite:

    r = requests.get(l)
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
        emails.append(link.get('href'))
    for tel in soup.findAll('a', attrs={'href': re.compile("^tel:")}):
        tels.append(tel.get('href'))

print(emails)

print(tels)

this is neither a regex nor an html parsing issue.这既不是正则表达式也不是 html 解析问题。 print out r.content and you will notice (eg for https://vtechbd.com/ ) that the actual html source you are parsing isn't the same as the one rendered by your browser when you access the site.打印出r.content ,您会注意到(例如,对于https://vtechbd.com/ ),实际的 html 源与您访问的浏览器访问的站点不同。

    <!-- Contact Page -->
<section class="content hide" id="contact">
    <h1>Contact</h1>
    <h5>Get in touch.</h5>
    <p>Email: <a href="/cdn-cgi/l/email-protection#44642d2a222b04323021272c26206a272b29"><span class="__cf_email__" data-cfemail="2e474048416e585a4b4d464c4a004d4143">[email&#160;protected]</span></a><br />

so I assume the information you are interested in is loaded dynamically by some javascript.所以我假设你感兴趣的信息是由一些 javascript 动态加载的。 python's requests library is an http client, not a web scraper. python 的请求库是 http 客户端,而不是 web 刮板。

...also, it's not cool to ask people to debug your code because it's 5pm, you want to get out of the office and hope somebody will have solved your issue by tomorrow morning...I may be wrong but the way your question is asked leaves me under the impression you spent like 2min pasting your source code in... ...另外,要求人们调试您的代码并不酷,因为现在是下午 5 点,您想离开办公室并希望明天早上有人能解决您的问题...我可能错了,但是您的问题的方式被问到让我觉得你花了 2 分钟将源代码粘贴到...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM