简体   繁体   English

Scrapy | 如何提出请求并获取所有链接

[英]Scrapy | How to make a request and get all links

I have a function to get all the links in the first page.我有一个 function 来获取第一页中的所有链接。
How can i make another function that make a request for each link in the list and get all the links from the second page response?我怎样才能制作另一个 function 来请求列表中的每个链接并从第二页响应中获取所有链接?

import scrapy

list = []

class SuperSpider(scrapy.Spider):
    name = 'nytimes'
    allowed_domains = ['nytimes.com']
    start_urls = ['https://www.nytimes.com/']
 
    def parse(self, response):
        links = response.xpath('//a/@href').extract()

        for link in links:
            link = str(link).strip()
            if link not in list:
                list.append(link)

Your use case is perfect for scrapy crawl spider.您的用例非常适合 scrapy crawl蜘蛛。 Note that the allowed_domains setting is very important in this case as it defines the domains that will be crawled.请注意, allowed_domains设置在这种情况下非常重要,因为它定义了将被抓取的域。 If you remove it, then your spider will go crazy crawling all links it will find on each of the pages.如果你删除它,那么你的蜘蛛会 go 疯狂地抓取它会在每个页面上找到的所有链接。 See sample below.请参阅下面的示例。

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class NytimesSpider(CrawlSpider):
    name = 'nytimes'
    allowed_domains = ['nytimes.com']
    start_urls = ['https://www.nytimes.com/']

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
    }

    rules = (
        Rule(LinkExtractor(allow=r''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            "url": response.url
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM