简体   繁体   English

使用 python scrapy 抓取不同域的多个网站

[英]Scraping multiple websites of different domains using python scrapy

I want to scrape two different websites' emails with their corresponding links, but I get two different emails with the same link.我想抓取两个不同网站的电子邮件及其相应的链接,但我收到两封具有相同链接的不同电子邮件。 Actually there are vaious websites to scrape, but for simplicity, I have used only two urls.实际上有很多网站可以抓取,但为了简单起见,我只使用了两个网址。 The code is given below:代码如下:

import scrapy
import re
import time
urls = ['http://www.manorhouseohio.com', 'http://www.OtterCreekAL.com']
class TheknotSpider(scrapy.Spider):
    name = 'theknot'
    def start_requests(self):
        global url
        li = ['http://www.manorhouseohio.com','http://www.OtterCreekAL.com']
        for url in range(len(li)):
            yield scrapy.Request(li[url], callback=self.parse)
    def parse(self, response):
        mail_link = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
        html_text = response.xpath("//body").get()
        mail_without_link = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", html_text)
        contact = response.xpath("//a[contains(@href,'/contact')]/@href").get()
        if mail_link:
            x = mail_link[7:]
            yield {
            "Main Website": urls[url],
            "Email": x
            }
            time.sleep(1)
        elif mail_without_link:
            yield {
            "Main Website": urls[url],
            "Email": mail_without_link
            }
            time.sleep(1)
        elif contact:
                if contact=="/contact" or contact=="/contact/" or contact=="/contact-us" or contact=="/contact-us/":
                    contact_main = urls[url] + contact
                    yield scrapy.Request(contact_main, callback=self.parse_final)       
                else:
                    yield scrapy.Request(contact, callback=self.parse_final)
        else:
            pass
    def parse_final(self, response):
        mail = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
        text = response.xpath("//body").get()
        mail_text = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
        if mail:
            x = mail[7:]
            yield {
            "Main Website": urls[url],
            "Email": x
            }
            time.sleep(1)
        elif mail_text:
            yield {
            "Main Website": urls[url],
            "Email": mail_text
            }
            time.sleep(1)
        else: pass

The problem in your code is that scrapy does not execute synchronously.您的代码中的问题是scrapy不同步执行。 This means that your global url variable is not what you think it is.这意味着您的global url变量不是您认为的那样。 start_requests will (likely) fully execute before any parse functions are called and so url will be 1 inside all your parse methods. start_requests将(可能)在调用任何解析函数之前完全执行,因此url在所有parse方法中都将为 1。

Rather than using the global variables, access the url of the response with response.url and if you need the base url then use urllib.parse (or similar) to get it.与其使用全局变量,不如使用response.url访问响应的 url,如果您需要基础 url,则使用urllib.parse它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM