I am calling a Scrapy Spider from a Python script . I would like to have access to the items that the Spider yields from within my script . But I have no idea how to do it.
The script works fine, the spider is called and it yields the right items, but I don't know how to access those items from my script.
This is the code for the script
Class UASpider(scrapy.Spider):
name = 'uaspider'
start_urls = ['http://httpbin.org/user-agent']
def parse(self, response):
payload = json.loads(response.body.decode(response.encoding))
yield {'ua':payload}
def main():
process = CrawlerProcess(get_project_settings())
process.crawl(UASpider)
process.start() # the script will block here until the crawling is finished
if (__name__ == '__main__'):
main()
And this is the part of the log that shows the spider works fine and yields the items.
2020-02-18 20:44:10 [scrapy.core.engine] INFO: Spider opened
2020-02-18 20:44:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-18 20:44:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-18 20:44:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: None)
2020-02-18 20:44:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/user-agent>
{'ua': {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}}
2020-02-18 20:44:10 [scrapy.core.engine] INFO: Closing spider (finished)
Thanks very much for your help!!
One option I can think of would be to create a pipeline that stores the item and then access the items from that storage:
I have managed to make this work following @Gallaecio suggestion, thanks!!.
This solution uses a pipeline that stores the value in a global variable. Settings are read from the Scrapy project and the extra pipeline is added in the script to avoid changing the overall settings.
Here is the code that makes it work
user_agent = ''
Class UASpider(scrapy.Spider):
name = 'uaspider'
start_urls = ['http://httpbin.org/user-agent']
def parse(self, response):
payload = json.loads(response.body.decode(response.encoding))
yield {'ua':payload}
class TempStoragePipeline(object):
def process_item(self, item, spider):
user_agent = item.get('ua').get('user-agent')
return item
def main():
settings = get_project_settings()
settings.set('ITEM_PIPELINES', {
'__main__.TempStoragePipeline': 100
})
process = CrawlerProcess(get_project_settings())
process.crawl(UASpider)
process.start() # the script will block here until the crawling is finished
if (__name__ == '__main__'):
print(f'>>> {user_agent}'
main()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.