简体   繁体   English

如何捕获蜘蛛对管道的 scrapy 响应?

[英]How to catch scrapy response from spider to pipeline?

I need all the scrapy response with settings, pipelines, urls and everything in pipeline where i create model objects?我需要所有 scrapy 响应,包括设置、管道、URL 和我创建 model 对象的管道中的所有内容? Is there any way of catching it?有什么办法可以抓到它吗?

pipeline.py


class ScraperPipeline(object):
    def process_item(self, item, spider):
        logger = get_task_logger("logs")
        logger.info("Pipeline activated.")
        id = item['id'][0]
        user= item['user'][0]
        text = item['text'][0]
        Mail.objects.create(user=User.objects.get_or_create(
            id=id, user=user),
            text=text, date=today)
        logger.info(f"Pipeline disacvtivated")
spider.py
class Spider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['xxx.com']

    def start_requests(self):
        urls = [
            'xxx.com',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse,
                                 headers={'User-Agent':
                                              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
                                              'like Gecko) Chrome/107.0.0.0 Safari/537.36'})

    def parse(self, response):
        item = MailItem()
        for row in response.xpath('xpath thins'):
            ip['id'] = row.xpath('td[1]//text()').extract_first(),
            ip['user'] = row.xpath('td[2]//text()').extract_first(),
            ip['text'] = row.xpath('td[3]//text()').extract_first(),
            yield item

I've tried to call response from pipeline, but i have only item.我试图调用管道的响应,但我只有项目。 Also the things from created object are not enough from me.此外,创建 object 的东西对我来说还不够。 Any ideas?有任何想法吗?

You can pass the full response along with the item in your callback methods if you need access to the response or request in your pipeline.如果您需要访问管道中的响应或请求,您可以在回调方法中将完整响应与项目一起传递。

For example:例如:

class SpiderClass(scrapy.Spider):
    ...
    ...

    def parse(self, response):
        for i in response.xpath(...):
            field1 = ...
            yield {'field1': field1, 'response': response}

Then in your pipeline you will have access to the response as a field of the item in the process_item method.然后在您的管道中,您将可以访问作为process_item方法中项目字段的响应。 You can also access the settings from this method by using the crawler attribute of the spider argument.您还可以使用 spider 参数的 crawler 属性从此方法访问设置。

For example:例如:

class MyPipeline:

    def process_item(self, item, spider):
        response = item['response']
        request = response.request
        settings = spider.crawler.settings
        ...   do something 
        del item['response']
        return item

Then you just need to activate the pipeline in your settings.然后你只需要在你的设置中激活管道。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM