简体   繁体   中英

Map a Website using only Top-Level Domain

I am working on a script that will extract emails from a given website. One complication I am running into is that oftentimes the email(s) I am looking for will be on "Contact Us" or "Our People" pages. So far what I have written will look for an email on the main webpage ie www.examplecompany.com, and if it doesn't find anything, it will look for emails in the pages linked on that page. See below:

import requests, bs4, re, sys, logging
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %(levelname)s - %(message)s')
print('Fetching Website...')
target_URL = 'www.exampleURL.com' #URL goes here
res = requests.get(target_URL) 
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
type(soup)
my_list = []
for link in soup.find_all('a'):
    my_list.append(link.get('href'))

emailregex = re.compile(r'''(
    [a-zA-Z0-9._%+-:]+
    @
    [a-zA-Z0-9.-]+
    \.[a-zA-Z]{2,4}
    )''', re.VERBOSE)

# Converts each item in list to string
myemail_list = list(map(str, my_list))
# Filters out items in list that to not fit regex criteria
newlist = list(filter(emailregex.search, myemail_list))

if len(newlist) < 1:
    new_site = []
    for i in range(len(my_list)):
        new_site.append(f'{target_URL}{(my_list[i])}')
    try:
        for site in range(len(new_site)):
            newthing = requests.get(new_site[site])
            newthing.raise_for_status()
            freshsoup = bs4.BeautifulSoup(newthing.text, 'lxml')
            type(freshsoup)
    except requests.exceptions.HTTPError as e:
        pass

    final_list = []
    for link in freshsoup.find_all('a'):
        final_list.append(link.get('href'))
    print(final_list)
else:
    print(newlist)

I think the biggest issue I am dealing with is that my method of putting together and searching for related URLs is just wrong. It will work on some sites, but not others and it is error prone. Can anyone give me a better idea of how to do this?

By the way, if it looks like I have no idea what I am doing, you are right. I just started learning python and this is a personal project to help me better grasp the basics, so any help is appreciated.

Thank you for your help.

Try:

import requests
import re
from bs4 import BeautifulSoup

all_links = [];mails=[]

# your url here

url = 'https://kore.ai/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
    if(("contact" in i or "Contact")or("Career" in i or "career" in i))or('about' in i or "About" in i)or('Services' in i or 'services' in i):
        all_links.append(i)
all_links=set(all_links)
def find_mails(soup):
    for name in soup.find_all('a'):
        if(name is not None):
            email_text=name.text
            match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',email_text))
            if('@' in email_text and match==True):
                email_text=email_text.replace(" ",'').replace('\r','')
                email_text=email_text.replace('\n','').replace('\t','')
                if(len(mails)==0)or(email_text not in mails):
                    print(email_text)
                mails.append(email_text)
for link in all_links:
    if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        find_mails(soup)

    else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        find_mails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM