简体   繁体   中英

Scrapy | How to make a request and get all links

I have a function to get all the links in the first page.
How can i make another function that make a request for each link in the list and get all the links from the second page response?

import scrapy

list = []

class SuperSpider(scrapy.Spider):
    name = 'nytimes'
    allowed_domains = ['nytimes.com']
    start_urls = ['https://www.nytimes.com/']
 
    def parse(self, response):
        links = response.xpath('//a/@href').extract()

        for link in links:
            link = str(link).strip()
            if link not in list:
                list.append(link)

Your use case is perfect for scrapy crawl spider. Note that the allowed_domains setting is very important in this case as it defines the domains that will be crawled. If you remove it, then your spider will go crazy crawling all links it will find on each of the pages. See sample below.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NytimesSpider(CrawlSpider):
    name = 'nytimes'
    allowed_domains = ['nytimes.com']
    start_urls = ['https://www.nytimes.com/']

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
    }

    rules = (
        Rule(LinkExtractor(allow=r''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            "url": response.url
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM