[英]Scrapy: Use Feed Exports after custom Item Pipeline without custom Feed Exporter class?
My Spider looks like this:我的蜘蛛看起来像这样:
class ExampleSpider(scrapy.Spider):
name = 'example'
custom_settings = {
'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
'FEEDS': {
'feeds/example/tags.csv': {
'format': 'csv',
'fields': ["tag_id", "url", "title"],
'item_export_kwargs': {
'include_headers_line': False,
},
'item_classes': [ExampleTagItem],
'overwrite': False
},
'feeds/example/galleries.csv': {
'format': 'csv',
'fields': ["id", "url", "tag_ids"],
'item_export_kwargs': {
'include_headers_line': False,
},
'item_classes': [ExampleGalleryItem],
'overwrite': False,
}
}
}
This is the img_clear.pipelines.DuplicatesPipeline
:这是
img_clear.pipelines.DuplicatesPipeline
:
class DuplicatesPipeline():
def open_spider(self, spider):
if spider.name == "example":
with open("feeds/example/galleries.csv", "r") as rf:
csv = rf.readlines()
self.ids_seen = set([str(line.split(",")[0]) for line in csv])
with open("feeds/example/tags.csv", "r") as rf:
tags_csv = rf.readlines()
self.tag_ids_seen = set([str(line.split(",")[0]) for line in tags_csv])
def process_item(self, item, spider):
if isinstance(item, ExampleTagItem):
self.process_example_tag_item(item, spider)
elif isinstance(item, ExampleGalleryItem):
self.process_example_gallery_item(item, spider)
def process_example_tag_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['tag_id'] in self.tag_ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.tag_ids_seen.add(adapter['tag_id'])
return item
def process_example_gallery_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['id'] in self.ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.ids_seen.add(adapter['id'])
return item
With the item pipeline activated it will drop some items (logging: [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',...
) and return others (logging: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/>
) but noting is written to the files.激活项目管道后,它将删除一些项目(记录:
[scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',...
)并返回其他项目(记录: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/>
)但注释已写入文件。 Somehow the returned items don't seem to reach the feed exports extension.不知何故,退回的项目似乎没有到达提要导出扩展。 What am I missing?
我错过了什么?
'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
in the custom_settings
, items are saved in the right csv-files.'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
在custom_settings
中,项目保存在正确的 csv 文件中。scrapy crawl example -o test.csv
will create an empty csv when the pipeline is activated as well.scrapy crawl example -o test.csv
将在管道被激活时创建一个空的 csv。 So it seems that the issue is with the pipeline. I have come across this same issue many times as well.我也多次遇到过同样的问题。 It has something to do with the path used in the
FEEDS
section of your custom settings.它与自定义设置的
FEEDS
部分中使用的路径有关。 There are a few things that you can try to fix the problem.您可以尝试一些方法来解决问题。
On linux/mac:在 Linux/Mac 上:
'FEEDS': {
'/absolute/path/to/feeds/example/galleries.csv': {
...
}
on Windows you can try it a few ways:在 Windows 上,您可以尝试几种方法:
'FEEDS': {
r'C:\Users\path\to\feeds\example\galleries.csv': {
...
}
or also on windows或者也可以拨打 windows
'FEEDS': {
'/Users/path/to/feeds/example/galleries.csv': {
...
}
'file///'
) path always seems to work even though it is mentioned in the documentation that it isn't necessary when exporting to local filesystem.'file///'
) 路径似乎总是有效,即使在文档中提到它在导出到本地文件系统时不是必需的。'FEEDS': {
'file:///absolute/path/to/feeds/example/galleries.csv': {
...
}
Thanks for the response, I'm not sure if this would actually have fixed it.感谢您的回复,我不确定这是否真的能解决问题。 since the feed was working perfectly with relative paths when the pipeline is deactivated.
因为当管道停用时,提要与相对路径完美配合。 I might test that anyways some time.
无论如何,我可能会测试一下。
However, I figured out an other mistake in my code that fixed it without changing the paths: The docs state, that the process_item
function must return an item object
, return a twisted Deferred
or raise a DropItem
exception.但是,我在我的代码中发现了另一个错误,它在不更改路径的情况下修复了它: 文档state,
process_item
function 必须返回一个item object
,返回一个扭曲的Deferred
或引发DropItem
异常。 My code was derived from here but I missed the return statements in the lines calling the process_..._item
functions.我的代码源自此处,但我错过了调用
process_..._item
函数的行中的 return 语句。
Tbh, I discovered the solution by accident trying to replicate my issue in a less complex spider and wrote up something like this and it worked:老实说,我无意中发现了解决方案,试图在一个不太复杂的蜘蛛中复制我的问题,并写下了这样的东西,它起作用了:
def process_item(self, item, spider):
if isinstance(item, ExampleTagItem):
adapter = ItemAdapter(item)
if adapter['tag_id'] in self.tag_ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.tag_ids_seen.add(adapter['tag_id'])
return item
elif isinstance(item, ExampleGalleryItem):
adapter = ItemAdapter(item)
if adapter['id'] in self.ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.ids_seen.add(adapter['id'])
return item
Since I'm very new to coding: Any suggestions how to reduce the repetition in this code?由于我对编码很陌生:有什么建议可以减少这段代码中的重复吗? I could use "id" in both Item objects but still would need to differentiate between the two sets so no idea how to do this...
我可以在两个 Item 对象中使用“id”,但仍然需要区分这两个集合,所以不知道该怎么做......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.