Scrapy：在没有自定义 Feed Exporter 的情况下自定义 Item Pipeline 后使用 Feed Exports class？

Question

My Spider looks like this:我的蜘蛛看起来像这样：

class ExampleSpider(scrapy.Spider):
    name = 'example'

    custom_settings = {
        'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
        'FEEDS': {
            'feeds/example/tags.csv': {
                'format': 'csv',
                'fields': ["tag_id", "url", "title"],
                'item_export_kwargs': {
                    'include_headers_line': False,
                },
                'item_classes': [ExampleTagItem],
                'overwrite': False
            },
            'feeds/example/galleries.csv': {
                'format': 'csv',
                'fields': ["id", "url", "tag_ids"],
                'item_export_kwargs': {
                    'include_headers_line': False,
                },
                'item_classes': [ExampleGalleryItem],
                'overwrite': False,
            }
        }
    }

This is the img_clear.pipelines.DuplicatesPipeline :这是img_clear.pipelines.DuplicatesPipeline ：

class DuplicatesPipeline():
    def open_spider(self, spider):
        if spider.name == "example":
            with open("feeds/example/galleries.csv", "r") as rf:
                csv = rf.readlines()
            self.ids_seen = set([str(line.split(",")[0]) for line in csv])
            
            with open("feeds/example/tags.csv", "r") as rf:
                tags_csv = rf.readlines()
            self.tag_ids_seen = set([str(line.split(",")[0]) for line in tags_csv])

    def process_item(self, item, spider):
        if isinstance(item, ExampleTagItem):
            self.process_example_tag_item(item, spider)    
        elif isinstance(item, ExampleGalleryItem):
            self.process_example_gallery_item(item, spider)

    def process_example_tag_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter['tag_id'] in self.tag_ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.tag_ids_seen.add(adapter['tag_id'])
            return item

    def process_example_gallery_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter['id'] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter['id'])
            return item

With the item pipeline activated it will drop some items (logging: [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',... ) and return others (logging: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/> ) but noting is written to the files.激活项目管道后，它将删除一些项目（记录： [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',... ）并返回其他项目（记录： [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/> ）但注释已写入文件。 Somehow the returned items don't seem to reach the feed exports extension.不知何故，退回的项目似乎没有到达提要导出扩展。 What am I missing?我错过了什么？

When commenting out the 'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,}, in the custom_settings , items are saved in the right csv-files.在注释掉'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},在custom_settings中，项目保存在正确的 csv 文件中。
Using scrapy crawl example -o test.csv will create an empty csv when the pipeline is activated as well.使用scrapy crawl example -o test.csv将在管道被激活时创建一个空的 csv。 So it seems that the issue is with the pipeline.因此，问题似乎出在管道上。
Printing the items right before they should be returned did print correct item information在应该退回之前打印物品确实打印了正确的物品信息
The pipeline is derived from the scrapy docs .该管道源自scrapy 文档。

Answer 1

I have come across this same issue many times as well.我也多次遇到过同样的问题。 It has something to do with the path used in the FEEDS section of your custom settings.它与自定义设置的FEEDS部分中使用的路径有关。 There are a few things that you can try to fix the problem.您可以尝试一些方法来解决问题。

Using absolute paths.使用绝对路径。

On linux/mac:在 Linux/Mac 上：

'FEEDS': {
    '/absolute/path/to/feeds/example/galleries.csv': {
    ...
}

on Windows you can try it a few ways:在 Windows 上，您可以尝试几种方法：

'FEEDS': {
    r'C:\Users\path\to\feeds\example\galleries.csv': {
        ...
}

or also on windows或者也可以拨打 windows

'FEEDS': {
    '/Users/path/to/feeds/example/galleries.csv': {
        ...
}

Adding a schema to the FEEDS ( 'file///' ) path always seems to work even though it is mentioned in the documentation that it isn't necessary when exporting to local filesystem.将模式添加到 FEEDS ( 'file///' ) 路径似乎总是有效，即使在文档中提到它在导出到本地文件系统时不是必需的。

'FEEDS': {
    'file:///absolute/path/to/feeds/example/galleries.csv': {
        ...
}

Answer 2

Thanks for the response, I'm not sure if this would actually have fixed it.感谢您的回复，我不确定这是否真的能解决问题。 since the feed was working perfectly with relative paths when the pipeline is deactivated.因为当管道停用时，提要与相对路径完美配合。 I might test that anyways some time.无论如何，我可能会测试一下。

However, I figured out an other mistake in my code that fixed it without changing the paths: The docs state, that the process_item function must return an item object , return a twisted Deferred or raise a DropItem exception.但是，我在我的代码中发现了另一个错误，它在不更改路径的情况下修复了它：文档state， process_item function 必须返回一个item object ，返回一个扭曲的Deferred或引发DropItem异常。 My code was derived from here but I missed the return statements in the lines calling the process_..._item functions.我的代码源自此处，但我错过了调用process_..._item函数的行中的 return 语句。

Tbh, I discovered the solution by accident trying to replicate my issue in a less complex spider and wrote up something like this and it worked:老实说，我无意中发现了解决方案，试图在一个不太复杂的蜘蛛中复制我的问题，并写下了这样的东西，它起作用了：

def process_item(self, item, spider):
    if isinstance(item, ExampleTagItem):
        adapter = ItemAdapter(item)
        if adapter['tag_id'] in self.tag_ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.tag_ids_seen.add(adapter['tag_id'])
        return item
    elif isinstance(item, ExampleGalleryItem):
        adapter = ItemAdapter(item)
        if adapter['id'] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter['id'])
        return item

Since I'm very new to coding: Any suggestions how to reduce the repetition in this code?由于我对编码很陌生：有什么建议可以减少这段代码中的重复吗？ I could use "id" in both Item objects but still would need to differentiate between the two sets so no idea how to do this...我可以在两个 Item 对象中使用“id”，但仍然需要区分这两个集合，所以不知道该怎么做......

Scrapy：在没有自定义 Feed Exporter 的情况下自定义 Item Pipeline 后使用 Feed Exports class？

问题描述

2 个解决方案

解决方案1
0 2023-01-24 04:43:06

解决方案2
0 2023-01-24 15:45:24

Scrapy：在没有自定义 Feed Exporter 的情况下自定义 Item Pipeline 后使用 Feed Exports class？

问题描述

2 个解决方案

解决方案1 0 2023-01-24 04:43:06

解决方案2 0 2023-01-24 15:45:24

解决方案1
0 2023-01-24 04:43:06

解决方案2
0 2023-01-24 15:45:24