如何捕获蜘蛛对管道的 scrapy 响应？

Question

I need all the scrapy response with settings, pipelines, urls and everything in pipeline where i create model objects?我需要所有 scrapy 响应，包括设置、管道、URL 和我创建 model 对象的管道中的所有内容？ Is there any way of catching it?有什么办法可以抓到它吗？

pipeline.py


class ScraperPipeline(object):
    def process_item(self, item, spider):
        logger = get_task_logger("logs")
        logger.info("Pipeline activated.")
        id = item['id'][0]
        user= item['user'][0]
        text = item['text'][0]
        Mail.objects.create(user=User.objects.get_or_create(
            id=id, user=user),
            text=text, date=today)
        logger.info(f"Pipeline disacvtivated")

spider.py
class Spider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['xxx.com']

    def start_requests(self):
        urls = [
            'xxx.com',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse,
                                 headers={'User-Agent':
                                              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
                                              'like Gecko) Chrome/107.0.0.0 Safari/537.36'})

    def parse(self, response):
        item = MailItem()
        for row in response.xpath('xpath thins'):
            ip['id'] = row.xpath('td[1]//text()').extract_first(),
            ip['user'] = row.xpath('td[2]//text()').extract_first(),
            ip['text'] = row.xpath('td[3]//text()').extract_first(),
            yield item

I've tried to call response from pipeline, but i have only item.我试图调用管道的响应，但我只有项目。 Also the things from created object are not enough from me.此外，创建 object 的东西对我来说还不够。 Any ideas?有任何想法吗？

Answer 1

You can pass the full response along with the item in your callback methods if you need access to the response or request in your pipeline.如果您需要访问管道中的响应或请求，您可以在回调方法中将完整响应与项目一起传递。

For example:例如：

class SpiderClass(scrapy.Spider):
    ...
    ...

    def parse(self, response):
        for i in response.xpath(...):
            field1 = ...
            yield {'field1': field1, 'response': response}

Then in your pipeline you will have access to the response as a field of the item in the process_item method.然后在您的管道中，您将可以访问作为process_item方法中项目字段的响应。 You can also access the settings from this method by using the crawler attribute of the spider argument.您还可以使用 spider 参数的 crawler 属性从此方法访问设置。

For example:例如：

class MyPipeline:

    def process_item(self, item, spider):
        response = item['response']
        request = response.request
        settings = spider.crawler.settings
        ...   do something 
        del item['response']
        return item

Then you just need to activate the pipeline in your settings.然后你只需要在你的设置中激活管道。

如何捕获蜘蛛对管道的 scrapy 响应？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-18 09:55:37

如何捕获蜘蛛对管道的 scrapy 响应？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-18 09:55:37

解决方案1
0 已采纳 2022-11-18 09:55:37