简体   繁体   English

处理 Scrapy 中的所有产量项

[英]Process all yield items in Scrapy

Currently I have a Scrapy Spider yielding various items on the parse method.目前我有一个 Scrapy Spider 在解析方法上产生各种项目。 Is there any way to get all items that have been yield, regardles of how many times the parse method has been called?有没有什么办法可以得到所有已经产生的项目,不管 parse 方法被调用了多少次?

Using pipeline you'll be able to accumulate all items in an array like structure ( process_item in your pipeline):使用管道,您将能够将所有项目累积在一个类似结构的数组中(管道中的process_item ):

self.items.append(item) # I use self.items class variable defined at your pipeline

and process all of them in spider_closed .并在spider_closed中处理所有这些。

I am unsure about what you mean by get the items.我不确定您所说的获取物品是什么意思。 If you want to export them into a file you can use the feed export , by executing the spider like:如果你想将它们导出到一个文件中,你可以使用feed export ,通过执行如下蜘蛛:

scrapy crawl my_spider -o my_data.csv

It supports other extensions, check the link for those.它支持其他扩展,请查看链接。

From your title it seems you want to process the yielded items, in that case you need an ItemPipeline .从您的标题看来,您想要处理生成的项目,在这种情况下,您需要一个ItemPipeline From the docs:从文档:

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.一个项目被蜘蛛抓取后,它被发送到项目管道,该管道通过几个按顺序执行的组件来处理它。

... ...

Typical uses of item pipelines are:项目管道的典型用途是:

  • cleansing HTML data清洗 HTML 数据

  • validating scraped data (checking that the items contain certain fields)验证抓取的数据(检查项目是否包含某些字段)

  • checking for duplicates (and dropping them)检查重复项(并删除它们)

  • storing the scraped item in a database将抓取的项目存储在数据库中

You can also see some pipelines examples here .您还可以在此处查看一些管道示例。

Both methods operate independently of how many times parse method has been called.这两种方法的运行与调用parse方法的次数无关。

There are generally two ways to do this.通常有两种方法可以做到这一点。

Firstly, you can simply save the output in the JSON file using command scrapy crawl my_spider -o my_data.json .首先,您可以使用命令scrapy crawl my_spider -o my_data.json将 output 保存在 JSON 文件中。 Secondly, you can write a pipeline and store the output in any DB to structure you want.其次,您可以编写一个管道并将 output 存储在任何数据库中以构建您想要的结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM