从脚本运行 Scrapy 时访问我的蜘蛛产生的项目

Question

我正在从 Python 脚本调用 Scrapy Spider 。 我想访问 Spider 从我的脚本中产生的项目。 但我不知道该怎么做。

脚本工作正常，蜘蛛被调用并产生正确的项目，但我不知道如何从我的脚本访问这些项目。

这是脚本的代码

Class UASpider(scrapy.Spider):
     name = 'uaspider'
     start_urls = ['http://httpbin.org/user-agent']

     def parse(self, response):
         payload = json.loads(response.body.decode(response.encoding))
         yield {'ua':payload}

 def main():
     process = CrawlerProcess(get_project_settings())
     process.crawl(UASpider)
     process.start() # the script will block here until the crawling is finished

 if (__name__ == '__main__'):
     main()

这是日志的一部分，显示蜘蛛工作正常并产生项目。

2020-02-18 20:44:10 [scrapy.core.engine] INFO: Spider opened
2020-02-18 20:44:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-18 20:44:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-18 20:44:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/user-agent> (referer: None)
2020-02-18 20:44:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://httpbin.org/user-agent>
{'ua': {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}}
2020-02-18 20:44:10 [scrapy.core.engine] INFO: Closing spider (finished)

非常感谢您的帮助！！

我能想到的一种选择是创建一个存储项目的管道，然后从该存储访问项目：

为此，我需要在脚本中配置管道（而不是在项目设置中）。
此外，存储在变量而不是文件中也是理想的（我这样做是为了自动化测试，速度很重要）。

Answer 1

我已经按照@Gallaecio 的建议设法完成了这项工作，谢谢！！

此解决方案使用将值存储在全局变量中的管道。 从 Scrapy 项目中读取设置，并在脚本中添加额外的管道以避免更改整体设置。

这是使其工作的代码

user_agent = ''

Class UASpider(scrapy.Spider):
     name = 'uaspider'
     start_urls = ['http://httpbin.org/user-agent']

     def parse(self, response):
         payload = json.loads(response.body.decode(response.encoding))
         yield {'ua':payload}

class TempStoragePipeline(object):
    def process_item(self, item, spider):
        user_agent = item.get('ua').get('user-agent')
        return item

def main():
    settings = get_project_settings()
    settings.set('ITEM_PIPELINES', {
        '__main__.TempStoragePipeline': 100
    })

    process = CrawlerProcess(get_project_settings())
    process.crawl(UASpider)
    process.start() # the script will block here until the crawling is finished

if (__name__ == '__main__'):
    print(f'>>> {user_agent}'
    main()

从脚本运行 Scrapy 时访问我的蜘蛛产生的项目

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-21 01:01:16

从脚本运行 Scrapy 时访问我的蜘蛛产生的项目

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-21 01:01:16

解决方案1
1 已采纳 2020-02-21 01:01:16