简体   繁体   中英

Access items yielded by my spider when running Scrapy from script

I am calling a Scrapy Spider from a Python script . I would like to have access to the items that the Spider yields from within my script . But I have no idea how to do it.

The script works fine, the spider is called and it yields the right items, but I don't know how to access those items from my script.

This is the code for the script

Class UASpider(scrapy.Spider):
     name = 'uaspider'
     start_urls = ['http://httpbin.org/user-agent']

     def parse(self, response):
         payload = json.loads(response.body.decode(response.encoding))
         yield {'ua':payload}

 def main():
     process = CrawlerProcess(get_project_settings())
     process.crawl(UASpider)
     process.start() # the script will block here until the crawling is finished

 if (__name__ == '__main__'):
     main()

And this is the part of the log that shows the spider works fine and yields the items.

2020-02-18 20:44:10 [scrapy.core.engine] INFO: Spider opened
2020-02-18 20:44:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-18 20:44:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-18 20:44:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: None)
2020-02-18 20:44:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/user-agent>
{'ua': {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}}
2020-02-18 20:44:10 [scrapy.core.engine] INFO: Closing spider (finished)

Thanks very much for your help!!


One option I can think of would be to create a pipeline that stores the item and then access the items from that storage:

  • For that to work I would need that the pipeline is configured within the script (not in the project settings).
  • Also it would be ideal to store in a variable rather than files (I´m doing this for automating tests and speed is important).

I have managed to make this work following @Gallaecio suggestion, thanks!!.

This solution uses a pipeline that stores the value in a global variable. Settings are read from the Scrapy project and the extra pipeline is added in the script to avoid changing the overall settings.

Here is the code that makes it work

user_agent = ''

Class UASpider(scrapy.Spider):
     name = 'uaspider'
     start_urls = ['http://httpbin.org/user-agent']

     def parse(self, response):
         payload = json.loads(response.body.decode(response.encoding))
         yield {'ua':payload}

class TempStoragePipeline(object):
    def process_item(self, item, spider):
        user_agent = item.get('ua').get('user-agent')
        return item

def main():
    settings = get_project_settings()
    settings.set('ITEM_PIPELINES', {
        '__main__.TempStoragePipeline': 100
    })

    process = CrawlerProcess(get_project_settings())
    process.crawl(UASpider)
    process.start() # the script will block here until the crawling is finished

if (__name__ == '__main__'):
    print(f'>>> {user_agent}'
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM