繁体   English   中英

使用多个 start_url 抓取多个页面

[英]Scraping multiple pages with multiple start_urls

我想使用scrapy抓取 json 表单中的详细信息。 它们是多个 start_url,每个 start_url 都有多个要抓取的页面。 我只是无法理解如何做到这一点的逻辑。

import scrapy
from scrapy.http import Request

BASE_URL = ["https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/civic/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/human-rights-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/child-rights-2/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/health-9/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/environment-18/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/education-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/women-s-rights-13/petitions?offset={}&limit=8&show_promoted_cards=true"
        ]

class ChangeSpider(scrapy.Spider):
    name = 'change'

    def start_requests(self):
        for i in range(len(BASE_URL)):
            yield Request(BASE_URL[i], callback = self.parse)

    pageNumber = 11

    def parse(self, response):
        data = response.json()
        for item in range(len(data['items'])):
            yield {
                "petition_id": data['items'][item]['petition']['id'],
            }

        next_page = "https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset=" + str(ChangeSpider.pageNumber) + "&limit=8&show_promoted_cards=true"       
        if data['last_page'] == False:
            ChangeSpider.pageNumber += 1
            yield response.follow(next_page, callback=self.parse) 

试试这样:

import scrapy
from scrapy.http import Request


class ChangeSpider(scrapy.Spider):
    name = 'change'

    start_urls = ["https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/civic/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/human-rights-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/child-rights-2/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/health-9/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/environment-18/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/education-en-in/petitions?offset={}&limit=8&show_promoted_cards=true",
        "https://www.change.org/api-proxy/-/tags/women-s-rights-13/petitions?offset={}&limit=8&show_promoted_cards=true"
        ]

    pageNumber = 11

    def parse(self, response):
        data = response.json()
        for item in range(len(data['items'])):
            yield {
                "petition_id": data['items'][item]['petition']['id'],
            }

        next_page = "https://www.change.org/api-proxy/-/tags/animals-19/petitions?offset=" + str(ChangeSpider.pageNumber) + "&limit=8&show_promoted_cards=true"       
        if data['last_page'] == False:
            ChangeSpider.pageNumber += 1
            yield response.follow(next_page, callback=self.parse) 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM