Map 仅使用顶级域的网站

Question

I am working on a script that will extract emails from a given website.我正在编写一个脚本，该脚本将从给定网站中提取电子邮件。 One complication I am running into is that oftentimes the email(s) I am looking for will be on "Contact Us" or "Our People" pages.我遇到的一个复杂情况是，我正在寻找的电子邮件通常会出现在“联系我们”或“我们的员工”页面上。 So far what I have written will look for an email on the main webpage ie www.examplecompany.com, and if it doesn't find anything, it will look for emails in the pages linked on that page.到目前为止，我所写的内容将在主网页即 www.examplecompany.com 上查找 email，如果找不到任何内容，它将在该页面上链接的页面中查找电子邮件。 See below:见下文：

import requests, bs4, re, sys, logging
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %(levelname)s - %(message)s')
print('Fetching Website...')
target_URL = 'www.exampleURL.com' #URL goes here
res = requests.get(target_URL) 
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
type(soup)
my_list = []
for link in soup.find_all('a'):
    my_list.append(link.get('href'))

emailregex = re.compile(r'''(
    [a-zA-Z0-9._%+-:]+
    @
    [a-zA-Z0-9.-]+
    \.[a-zA-Z]{2,4}
    )''', re.VERBOSE)

# Converts each item in list to string
myemail_list = list(map(str, my_list))
# Filters out items in list that to not fit regex criteria
newlist = list(filter(emailregex.search, myemail_list))

if len(newlist) < 1:
    new_site = []
    for i in range(len(my_list)):
        new_site.append(f'{target_URL}{(my_list[i])}')
    try:
        for site in range(len(new_site)):
            newthing = requests.get(new_site[site])
            newthing.raise_for_status()
            freshsoup = bs4.BeautifulSoup(newthing.text, 'lxml')
            type(freshsoup)
    except requests.exceptions.HTTPError as e:
        pass

    final_list = []
    for link in freshsoup.find_all('a'):
        final_list.append(link.get('href'))
    print(final_list)
else:
    print(newlist)

I think the biggest issue I am dealing with is that my method of putting together and searching for related URLs is just wrong.我认为我正在处理的最大问题是我整理和搜索相关 URL 的方法是错误的。 It will work on some sites, but not others and it is error prone.它适用于某些网站，但不适用于其他网站，而且容易出错。 Can anyone give me a better idea of how to do this?谁能让我更好地了解如何做到这一点？

By the way, if it looks like I have no idea what I am doing, you are right.顺便说一句，如果看起来我不知道自己在做什么，那你是对的。 I just started learning python and this is a personal project to help me better grasp the basics, so any help is appreciated.我刚开始学习 python，这是一个个人项目，可以帮助我更好地掌握基础知识，因此感谢您的帮助。

Thank you for your help.谢谢您的帮助。

Answer 1

Try:尝试：

import requests
import re
from bs4 import BeautifulSoup

all_links = [];mails=[]

# your url here

url = 'https://kore.ai/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
    if(("contact" in i or "Contact")or("Career" in i or "career" in i))or('about' in i or "About" in i)or('Services' in i or 'services' in i):
        all_links.append(i)
all_links=set(all_links)
def find_mails(soup):
    for name in soup.find_all('a'):
        if(name is not None):
            email_text=name.text
            match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',email_text))
            if('@' in email_text and match==True):
                email_text=email_text.replace(" ",'').replace('\r','')
                email_text=email_text.replace('\n','').replace('\t','')
                if(len(mails)==0)or(email_text not in mails):
                    print(email_text)
                mails.append(email_text)
for link in all_links:
    if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        find_mails(soup)

    else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        find_mails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

Map 仅使用顶级域的网站

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-04-05 17:54:26

Map 仅使用顶级域的网站

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-04-05 17:54:26

解决方案1
0 已采纳 2020-04-05 17:54:26