使用 python scrapy 抓取不同域的多個網站

Question

我想抓取兩個不同網站的電子郵件及其相應的鏈接，但我收到兩封具有相同鏈接的不同電子郵件。 實際上有很多網站可以抓取，但為了簡單起見，我只使用了兩個網址。 代碼如下：

import scrapy
import re
import time
urls = ['http://www.manorhouseohio.com', 'http://www.OtterCreekAL.com']
class TheknotSpider(scrapy.Spider):
    name = 'theknot'
    def start_requests(self):
        global url
        li = ['http://www.manorhouseohio.com','http://www.OtterCreekAL.com']
        for url in range(len(li)):
            yield scrapy.Request(li[url], callback=self.parse)
    def parse(self, response):
        mail_link = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
        html_text = response.xpath("//body").get()
        mail_without_link = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", html_text)
        contact = response.xpath("//a[contains(@href,'/contact')]/@href").get()
        if mail_link:
            x = mail_link[7:]
            yield {
            "Main Website": urls[url],
            "Email": x
            }
            time.sleep(1)
        elif mail_without_link:
            yield {
            "Main Website": urls[url],
            "Email": mail_without_link
            }
            time.sleep(1)
        elif contact:
                if contact=="/contact" or contact=="/contact/" or contact=="/contact-us" or contact=="/contact-us/":
                    contact_main = urls[url] + contact
                    yield scrapy.Request(contact_main, callback=self.parse_final)       
                else:
                    yield scrapy.Request(contact, callback=self.parse_final)
        else:
            pass
    def parse_final(self, response):
        mail = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
        text = response.xpath("//body").get()
        mail_text = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
        if mail:
            x = mail[7:]
            yield {
            "Main Website": urls[url],
            "Email": x
            }
            time.sleep(1)
        elif mail_text:
            yield {
            "Main Website": urls[url],
            "Email": mail_text
            }
            time.sleep(1)
        else: pass

Answer 1

您的代碼中的問題是scrapy不同步執行。 這意味着您的global url變量不是您認為的那樣。 start_requests將（可能）在調用任何解析函數之前完全執行，因此url在所有parse方法中都將為 1。

與其使用全局變量，不如使用response.url訪問響應的 url，如果您需要基礎 url，則使用urllib.parse它。

使用 python scrapy 抓取不同域的多個網站

問題描述

1 個解決方案

解決方案1
0 2021-03-23 10:29:15

使用 python scrapy 抓取不同域的多個網站

問題描述

1 個解決方案

解決方案1 0 2021-03-23 10:29:15

解決方案1
0 2021-03-23 10:29:15