Scrapy | How to make a request and get all links

Question

I have a function to get all the links in the first page.
How can i make another function that make a request for each link in the list and get all the links from the second page response?

import scrapy

list = []

class SuperSpider(scrapy.Spider):
    name = 'nytimes'
    allowed_domains = ['nytimes.com']
    start_urls = ['https://www.nytimes.com/']
 
    def parse(self, response):
        links = response.xpath('//a/@href').extract()

        for link in links:
            link = str(link).strip()
            if link not in list:
                list.append(link)

Answer 1

Your use case is perfect for scrapy crawl spider. Note that the allowed_domains setting is very important in this case as it defines the domains that will be crawled. If you remove it, then your spider will go crazy crawling all links it will find on each of the pages. See sample below.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NytimesSpider(CrawlSpider):
    name = 'nytimes'
    allowed_domains = ['nytimes.com']
    start_urls = ['https://www.nytimes.com/']

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
    }

    rules = (
        Rule(LinkExtractor(allow=r''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            "url": response.url
        }

Scrapy | How to make a request and get all links

Question

1 answers

solution1
1 2022-01-25 13:54:30

Scrapy | How to make a request and get all links

Question

1 answers

solution1 1 2022-01-25 13:54:30

solution1
1 2022-01-25 13:54:30