简体   繁体   English

使用scrapy获取网址列表,然后抓取这些网址内的内容

[英]Use scrapy to get list of urls, and then scrape content inside those urls

I need a Scrapy spider to scrape the following page ( https://www.phidgets.com/?tier=1&catid=64&pcid=57 ) for each URL (30 products, so 30 urls) and then go into each product via that url and scrape the data inside. 我需要一个Scrapy蜘蛛来为每个URL(30个产品,所以30个网址)抓取以下页面( https://www.phidgets.com/?tier=1&catid=64&pcid=57 ),然后通过该网址进入每个产品并刮掉里面的数据。

I have the second part working exactly as I want: 我的第二部分正如我想要的那样工作:

import scrapy

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = [
        'https://www.phidgets.com/?tier=1&catid=64&pcid=57',
    ]

    def parse(self, response):
        for info in response.css('div.ph-product-container'):
            yield {
                'product_name': info.css('h2.ph-product-name::text').extract_first(),
                'product_image': info.css('div.ph-product-img-ctn a').xpath('@href').extract(),
                'sku': info.css('span.ph-pid').xpath('@prod-sku').extract_first(),
                'short_description': info.css('div.ph-product-summary::text').extract_first(),
                'price': info.css('h2.ph-product-price > span.price::text').extract_first(),
                'long_description': info.css('div#product_tab_1').extract_first(),
                'specs': info.css('div#product_tab_2').extract_first(),
            }

        # next_page = response.css('div.ph-summary-entry-ctn a::attr("href")').extract_first()
        # if next_page is not None:
        #     yield response.follow(next_page, self.parse)

But I don't know how to do the first part. 但我不知道如何做第一部分。 As you will see I have the main page ( https://www.phidgets.com/?tier=1&catid=64&pcid=57 ) set as the start_url. 正如您将看到我将主页( https://www.phidgets.com/?tier=1&catid=64&pcid=57 )设置为start_url。 But how do I get it to populate the start_urls list with all 30 urls I need crawled? 但是如何使用我需要抓取的所有30个网址来填充start_urls列表呢?

I am not able to test at this moment, so please let me know if this works for you so I can edit it should there be any bugs. 我现在无法测试,所以请告诉我这是否适合您,以便我可以编辑它,如果有任何错误。

The idea here is that we find every link in the first page and yield new scrapy requests passing your product parsing method as a callback 这里的想法是我们找到第一页中的每个链接并产生新的scrapy请求,将您的产品解析方法作为回调传递

import scrapy
from urllib.parse import urljoin

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = [
        'https://www.phidgets.com/?tier=1&catid=64&pcid=57',
    ]

    def parse(self, response):
        products = response.xpath("//*[contains(@class, 'ph-summary-entry-ctn')]/a/@href").extract()
        for p in products:
            url = urljoin(response.url, p)
            yield scrapy.Request(url, callback=self.parse_product)

    def parse_product(self, response):
        for info in response.css('div.ph-product-container'):
            yield {
                'product_name': info.css('h2.ph-product-name::text').extract_first(),
                'product_image': info.css('div.ph-product-img-ctn a').xpath('@href').extract(),
                'sku': info.css('span.ph-pid').xpath('@prod-sku').extract_first(),
                'short_description': info.css('div.ph-product-summary::text').extract_first(),
                'price': info.css('h2.ph-product-price > span.price::text').extract_first(),
                'long_description': info.css('div#product_tab_1').extract_first(),
                'specs': info.css('div#product_tab_2').extract_first(),
            }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM