简体   繁体   中英

How to get request object in scrapy pipeline

I know that when the pipelines are called, it means the request have been stopped, generally we should do some validation,persist job based on the extracted item, it seems there is no sense to get request in the pipeline.

However I found it may useful in certain situation,in my application I use two pipelines: FilesPipeline and MysqlStorePipeline .

When an item is extracted, the FilesPipeline will tried to send request to get the image of the item, and save them to the db after completed.

However I use a download middleware RandomProxy at the sametime, which will get a proxy record randomly from the database, and set it to the request meta. But the proxy is not granted can be used all the time.

So the following may happen:

When retrieve the item, a proxy http://proxy1 is used, but it can not be used, thanks to the retry middleware, scrapy will try again, and another proxy http://proxy2 is fetched from db, if it can be used, an item is generated, then FilesPipeline will tried to download the image for the item by sending an image request which will be filled with a proxy say it is http://proxy3 , once the proxy3 can not be used, scrapy will retry too. But there are chances of getting bad proxies during all the retry. Then the item will be dropped because of no bound image fetched which MUST can not be empty.

Furthermore, the image request does not contain a referer which may be blocked by the server sometime.

So I wonder if the origin request used to extract an item can be accessed through the pipeline.

Is this possible or other suggestion?

Here are two approaches:

  1. Add a dummy field to the item to store whatever you want in the spider code. And later retrieve the value (and pop out the field) in the item pipeline.

  2. Instead of using an item pipeline, use a spider middleware . In its process_spider_output method you could access both the response and the spider output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM