简体   繁体   English

从网站上抓取电子邮件

[英]Scraping E-mails from Websites

I have tried several iterations from other posts and nothing seems to be helping or working for my needs.我已经尝试了其他帖子的多次迭代,但似乎没有任何帮助或满足我的需求。

I have a list of URLs that I want to loop through and pull all associated URLs that contain email addresses.我有一个 URL 列表,我想循环并提取所有包含 email 地址的关联 URL。 I then want to store the URLs and Email Addresses into a csv file.然后我想将 URL 和 Email 地址存储到csv文件中。

For example, if I went to 10torr.com, the program should find each of the sites within the main URL (ie: 10torr.com/about) and pull any emails.例如,如果我访问 10torr.com,程序应该找到主 URL(即:10torr.com/about)中的每个站点并提取任何电子邮件。

Below is a list of 5 example websites that are currently in a data frame format when run through my code.下面是通过我的代码运行时当前采用数据框格式的 5 个示例网站的列表。 They are saved under the variable small_site .它们保存在变量small_site下。

A helpful answer will include the use of the user defined function listed below called get_info() .一个有用的答案将包括使用下面列出的名为get_info()的用户定义 function 。 Hard coding the the websites is into the Spider itself is not a feasible option as this will be used by many other people with different website list lengths.将网站硬编码到 Spider 本身不是一个可行的选择,因为这将被许多其他具有不同网站列表长度的人使用。

    Website
    http://10torr.com/
    https://www.10000drops.com/
    https://www.11wells.com/
    https://117westspirits.com/
    https://www.onpointdistillery.com/

Below is the code that I am running.下面是我正在运行的代码。 The spider seems to run, but there is no output in my csv file.蜘蛛似乎在运行,但我的csv文件中没有 output。


import os
import pandas as pd
import re
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

small_site = site.head()


#%% Start Spider
class MailSpider(scrapy.Spider):

    name = 'email'

    def parse(self, response):

        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        for word in self.reject:
            if word in str(response.url):
                return

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)

        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)


#%% Preps a CSV File
def ask_user(question):
    response = input(question + ' y/n' + '\n')
    if response == 'y':
        return True
    else:
        return False
def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 

    with open(path, 'wb') as file: 
        file.close()


#%% Defines function that will extract emails and enter it into CSV
def get_info(url_list, path, reject=[]):

    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)


    print('Collecting Google urls...')
    google_urls = url_list


    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.start() 

    for i in small_site.Website.iteritems():
        print('Searching for emails...')
        process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
        ##process.start()

        print('Cleaning emails...')
        df = pd.read_csv(path, index_col=0)
        df.columns = ['email', 'link']
        df = df.drop_duplicates(subset='email')
        df = df.reset_index(drop=True)
        df.to_csv(path, mode='w', header=True)


    return df


url_list = small_site
path = 'email.csv'

df = get_info(url_list, path)

I am not certain where I am going wrong as I am not getting any error messages.我不确定哪里出错了,因为我没有收到任何错误消息。 If you need additional information please just ask.如果您需要更多信息,请询问。 I have been trying to get this for almost a month now and I feel like I am just banging my head against the wall at this point.我已经尝试了将近一个月了,我觉得此时我只是在用头撞墙。

The majority of this code was found on the article Web scraping to extract contact information— Part 1: Mailing Lists after a few weeks.此代码的大部分内容可在文章Web 抓取以提取联系信息 — 第 1 部分:几周后的邮件列表中找到。 However, I have not been successful in expanding it to my needs.但是,我没有成功地将它扩展到我的需要。 It worked no problem with one offs while incorporating their google search function to get the base URLs.在合并他们的谷歌搜索 function 以获取基本 URL 时,它一次性工作没有问题。

Thank you in advance for any assistance you are able to provide.预先感谢您提供的任何帮助。

It took awhile, but the answer finally came to me.花了一段时间,但答案终于来了。 The following is how the final answer came to be.以下是最终答案的产生方式。 This will work with a changing list as was the original question.这将与原始问题一样使用不断变化的列表。

The change ended up being very minor.变化最终非常小。 I needed to add the following user defined function.我需要添加以下用户定义的 function。

def get_urls(io, sheet_name):
    data = pd.read_excel(io, sheet_name)
    urls = data['Website'].to_list()
    return urls

From there, it was a simple change to the get_info() user defined function.从那里,对get_info()用户定义的 function 进行了简单的更改。 We needed to set google_urls in this function to our get_urls function and pass in the list.我们需要将此 function 中的google_urls设置为我们的get_urls function 并传入列表。 The full code for this function is below.此 function 的完整代码如下。

def get_info(io, sheet_name, path, reject=[]):
    
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)
    
    print('Collecting Google urls...')
    google_urls = get_urls(io, sheet_name)
    
    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
    process.start()
    
    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)
    
    return df

No other changes were needed to get this to run.无需进行其他更改即可使其运行。 Hopefully this helps.希望这会有所帮助。

I modified some scripts are ran the following script via Shell and it works.我修改了一些脚本,通过 Shell 运行以下脚本,它可以工作。 May be it will provide you as an starting point.也许它将为您提供一个起点。

I advise you to use the shell as it always throws errors and other messages during the scraping process我建议你使用 shell 因为它在抓取过程中总是抛出错误和其他消息


class MailSpider(scrapy.Spider):

    name = 'email'
    start_urls = [
        'http://10torr.com/',
        'https://www.10000drops.com/',
        'https://www.11wells.com/',
        'https://117westspirits.com/',
        'https://www.onpointdistillery.com/',
    ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}

        for key in dic.keys():
            yield {
                'email' : dic['email'],
                'link': dic['link'],
            }

This yields the following output when crawled via Anaconda shell scrapy crawl email -o test.jl This yields the following output when crawled via Anaconda shell scrapy crawl email -o test.jl

{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/"}
{"email": ["8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress", "bundle@3.2", "fetch@3.0", "bolt@2.3", "5oclock@11wells.com", "5oclock@11wells.com", "5oclock@11wells.com"], "link": "https://www.11wells.com"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=cart"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/home"}
{"email": ["8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress", "bundle@3.2", "fetch@3.0", "bolt@2.3", "5oclock@11wells.com", "5oclock@11wells.com", "5oclock@11wells.com"], "link": "https://www.11wells.com"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/home"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/117%C2%B0-west-spirits-1"}
...
...
...

Refer Scrapy docs for more information有关更多信息,请参阅Scrapy 文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM