簡體   English   中英

如何使用scrapy抓取兩個不同的域?

[英]How to scrape on two different domain using scrapy?

嗨,我想在我的腳本中抓取2個不同的域,我已經嘗試過if語句,但是我似乎認為它不起作用,請問有什么主意嗎?

這是我的代碼

class SalesitemSpiderSpider(scrapy.Spider):
    name = 'salesitem_spider'
    allowed_domains = ['www2.hm.com']
    start_urls = [
         'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=9999',
         'https://www.forever21.com/us/shop/catalog/category/f21/sale',
     ]

    def parse_start_url(response):
        if (response.url == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=9999'):
            parse_1(response)
        if (response.url == 'https://www.forever21.com/us/shop/catalog/category/f21/sale'):
            parse_2(response)

    def parse_1(self, response):
        for product_item in response.css('li.product-item'):
            item = {
                'title': product_item.css('h3.item-heading a.link::text').extract_first(),
                'regular-price': product_item.css('strong.item-price span.price.regular::text').extract_first(),
                'sale-price': product_item.css('strong.item-price span.price.sale::text').extract_first(),
                'photo-url': product_item.css('.image-container img::attr(data-src)').extract_first(),
                'description-url': "https://www2.hm.com/" + product_item.css('h3.item-heading a::attr(href)').extract_first(),
            }
            yield item

    def parse_2(self, response):
        #Some code getting item on domain 2

請幫忙謝謝

檢查您的allowed_domains變量。 您應該添加新域名,例如['www2.hm.com', 'forever21.com']或將其刪除。 您也沒有parse功能。

我可以假設使用if刪除您的start_urls並改為使用start_requests 您的代碼將更具可讀性。

import scrapy


class SalesitemSpiderSpider(scrapy.Spider):
    name = 'salesitem_spider'
    allowed_domains = ['www2.hm.com', 'forever21.com']

    def start_requests(self):
        urls = (
            (self.parse_1, 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=9999'),
            (self.parse_2, 'https://www.forever21.com/us/shop/catalog/category/f21/sale'),
        )
        for cb, url in urls:
            yield scrapy.Request(url, callback=cb)

    def parse_1(self, response):
        print 111111111

    def parse_2(self, response):
        print 2222222222

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM