简体   繁体   English

如何理解 scrapy.Request 中的回调 function?

[英]How to understand callback function in scrapy.Request?

I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.我正在阅读 Web Scraping with Python 2nd Ed,并想使用 Scrapy 模块从网页中抓取信息。

I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html我从文档中获得了以下信息: https://docs.scrapy.org/en/latest/topics/request-response.html

callback (callable) – the function that will be called with the response of this request (once it's downloaded) as its first parameter. callback (callable) – function 将调用此请求的响应(一旦下载)作为其第一个参数。 For more information see Passing additional data to callback functions below.有关更多信息,请参阅下面的将附加数据传递给回调函数。 If a Request doesn't specify a callback, the spider's parse() method will be used.如果请求没有指定回调,则将使用蜘蛛的 parse() 方法。 Note that if exceptions are raised during processing, errback is called instead.请注意,如果在处理过程中引发异常,则会调用 errback。

My understanding is that:我的理解是:

  1. pass in url and get resp like we did in requests module传入 url 并像我们在请求模块中一样获得响应

    resp = requests.get(url) resp = requests.get(url)

  2. pass in resp for data parsing传入resp进行数据解析

    parse(resp)解析(响应)

The problem is:问题是:

  1. I didn't see where resp is passed in我没有看到 resp 在哪里传递
  2. Why need to put self keyword before parse in the argument为什么需要在参数中解析之前放置 self 关键字
  3. self keyword was never used in parse function, why bothering put it as first parameter?解析 function 时从未使用 self 关键字,为什么要把它作为第一个参数?
  4. can we extract url from response parameter like this: url = response.url or should be url = self.url can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
    name='article'
    
    def start_requests(self):
        urls = [
        'http://en.wikipedia.org/wiki/Python_'
        '%28programming_language%29',
        'https://en.wikipedia.org/wiki/Functional_programming',
        'https://en.wikipedia.org/wiki/Monty_Python']

        return [scrapy.Request(url=url, callback=self.parse) for url in urls]
    

    def parse(self, response):
        url = response.url
        title = response.css('h1::text').extract_first()
        print('URL is: {}'.format(url))
        print('Title is: {}'.format(title))

information about self you can find here - https://docs.python.org/3/tutorial/classes.html您可以在此处找到有关self的信息 - https://docs.python.org/3/tutorial/classes.html


about this question:关于这个问题:

can we extract URL from response parameter like this: url = response.url or should be url = self.url

you should use response.url to get URL of the page which you currently crawl/parse您应该使用response.url来获取您当前抓取/解析的页面的 URL

Seems like you are missing a few concepts related to python classes and OOP.似乎您缺少一些与 python 类和 OOP 相关的概念。 It would be a good idea to take a read in python docs or at the very least this question .阅读 python文档或至少阅读这个问题是个好主意。

Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.下面是 Scrapy 的工作原理,您实例化一个请求 object 并将其交给 Scrapy 调度程序。

yield scrapy.Request(url=url) #or use return like you did

Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. Scrapy 将处理请求,下载 html 并将它返回的所有请求返回给回调 function。 If you didn't set a callback function in your request (like in my example above) it will call a default function called parse .如果您没有在请求中设置回调 function (如我上面的示例),它将调用默认的 function 称为parse

Parse is a method (aka function) of your object. Parse 是 object 的一种方法(又名函数)。 You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class您在上面的代码中编写了它,即使您没有它仍然存在,因为您的 class 继承了它的父 class 的所有功能

class ArticleSpider(scrapy.Spider): # <<<<<<<< here
    name='article'

So a TL;所以一个TL; DR of your questions:您的问题博士:

1-You didn't saw it because it happened in the parent class. 1-您没有看到它,因为它发生在父 class 中。

2-You need to use self. 2-你需要使用self. so python knows you are referencing a method of the spider instance.所以 python 知道您正在引用蜘蛛实例的方法。

3-The self parameter was the instance it self , and it was used by python. 3- self参数是它自己的实例,它被python使用。

4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers 4-Response 是一个独立的 object,您的 parse 方法作为参数接收,因此您可以访问它的属性,如response.urlresponse.headers

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM