简体   繁体   中英

How to extract Email, Telephone, Fax number and Address from many different html links by writing a python script?

I tried this code but it isn't working right (not extracting from all sites etc and many other issues with this). Need help!

from bs4 import BeautifulSoup

import re

import requests

allsite = ["https://www.ionixxtech.com/", "https://sumatosoft.com", "https://4irelabs.com/", "https://www.leewayhertz.com/",
           "https://stackoverflow.com", "https://www.vardot.com/en", "http://www.clickjordan.net/", "https://vtechbd.com/"]

emails = []

tels = []

for l in allsite:

    r = requests.get(l)
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
        emails.append(link.get('href'))
    for tel in soup.findAll('a', attrs={'href': re.compile("^tel:")}):
        tels.append(tel.get('href'))

print(emails)

print(tels)

this is neither a regex nor an html parsing issue. print out r.content and you will notice (eg for https://vtechbd.com/ ) that the actual html source you are parsing isn't the same as the one rendered by your browser when you access the site.

    <!-- Contact Page -->
<section class="content hide" id="contact">
    <h1>Contact</h1>
    <h5>Get in touch.</h5>
    <p>Email: <a href="/cdn-cgi/l/email-protection#44642d2a222b04323021272c26206a272b29"><span class="__cf_email__" data-cfemail="2e474048416e585a4b4d464c4a004d4143">[email&#160;protected]</span></a><br />

so I assume the information you are interested in is loaded dynamically by some javascript. python's requests library is an http client, not a web scraper.

...also, it's not cool to ask people to debug your code because it's 5pm, you want to get out of the office and hope somebody will have solved your issue by tomorrow morning...I may be wrong but the way your question is asked leaves me under the impression you spent like 2min pasting your source code in...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM