简体   繁体   中英

How to catch scrapy response from spider to pipeline?

I need all the scrapy response with settings, pipelines, urls and everything in pipeline where i create model objects? Is there any way of catching it?

pipeline.py


class ScraperPipeline(object):
    def process_item(self, item, spider):
        logger = get_task_logger("logs")
        logger.info("Pipeline activated.")
        id = item['id'][0]
        user= item['user'][0]
        text = item['text'][0]
        Mail.objects.create(user=User.objects.get_or_create(
            id=id, user=user),
            text=text, date=today)
        logger.info(f"Pipeline disacvtivated")
spider.py
class Spider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['xxx.com']

    def start_requests(self):
        urls = [
            'xxx.com',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse,
                                 headers={'User-Agent':
                                              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
                                              'like Gecko) Chrome/107.0.0.0 Safari/537.36'})

    def parse(self, response):
        item = MailItem()
        for row in response.xpath('xpath thins'):
            ip['id'] = row.xpath('td[1]//text()').extract_first(),
            ip['user'] = row.xpath('td[2]//text()').extract_first(),
            ip['text'] = row.xpath('td[3]//text()').extract_first(),
            yield item

I've tried to call response from pipeline, but i have only item. Also the things from created object are not enough from me. Any ideas?

You can pass the full response along with the item in your callback methods if you need access to the response or request in your pipeline.

For example:

class SpiderClass(scrapy.Spider):
    ...
    ...

    def parse(self, response):
        for i in response.xpath(...):
            field1 = ...
            yield {'field1': field1, 'response': response}

Then in your pipeline you will have access to the response as a field of the item in the process_item method. You can also access the settings from this method by using the crawler attribute of the spider argument.

For example:

class MyPipeline:

    def process_item(self, item, spider):
        response = item['response']
        request = response.request
        settings = spider.crawler.settings
        ...   do something 
        del item['response']
        return item

Then you just need to activate the pipeline in your settings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM