简体   繁体   English

Scrapy:在没有自定义 Feed Exporter 的情况下自定义 Item Pipeline 后使用 Feed Exports class?

[英]Scrapy: Use Feed Exports after custom Item Pipeline without custom Feed Exporter class?

My Spider looks like this:我的蜘蛛看起来像这样:

class ExampleSpider(scrapy.Spider):
    name = 'example'

    custom_settings = {
        'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
        'FEEDS': {
            'feeds/example/tags.csv': {
                'format': 'csv',
                'fields': ["tag_id", "url", "title"],
                'item_export_kwargs': {
                    'include_headers_line': False,
                },
                'item_classes': [ExampleTagItem],
                'overwrite': False
            },
            'feeds/example/galleries.csv': {
                'format': 'csv',
                'fields': ["id", "url", "tag_ids"],
                'item_export_kwargs': {
                    'include_headers_line': False,
                },
                'item_classes': [ExampleGalleryItem],
                'overwrite': False,
            }
        }
    }

This is the img_clear.pipelines.DuplicatesPipeline :这是img_clear.pipelines.DuplicatesPipeline

class DuplicatesPipeline():
    def open_spider(self, spider):
        if spider.name == "example":
            with open("feeds/example/galleries.csv", "r") as rf:
                csv = rf.readlines()
            self.ids_seen = set([str(line.split(",")[0]) for line in csv])
            
            with open("feeds/example/tags.csv", "r") as rf:
                tags_csv = rf.readlines()
            self.tag_ids_seen = set([str(line.split(",")[0]) for line in tags_csv])

    def process_item(self, item, spider):
        if isinstance(item, ExampleTagItem):
            self.process_example_tag_item(item, spider)    
        elif isinstance(item, ExampleGalleryItem):
            self.process_example_gallery_item(item, spider)

    def process_example_tag_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter['tag_id'] in self.tag_ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.tag_ids_seen.add(adapter['tag_id'])
            return item

    def process_example_gallery_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter['id'] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter['id'])
            return item

With the item pipeline activated it will drop some items (logging: [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',... ) and return others (logging: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/> ) but noting is written to the files.激活项目管道后,它将删除一些项目(记录: [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',... )并返回其他项目(记录: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/> )但注释已写入文件。 Somehow the returned items don't seem to reach the feed exports extension.不知何故,退回的项目似乎没有到达提要导出扩展。 What am I missing?我错过了什么?

  • When commenting out the 'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,}, in the custom_settings , items are saved in the right csv-files.在注释掉'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},custom_settings中,项目保存在正确的 csv 文件中。
  • Using scrapy crawl example -o test.csv will create an empty csv when the pipeline is activated as well.使用scrapy crawl example -o test.csv将在管道被激活时创建一个空的 csv。 So it seems that the issue is with the pipeline.因此,问题似乎出在管道上。
  • Printing the items right before they should be returned did print correct item information在应该退回之前打印物品确实打印了正确的物品信息
  • The pipeline is derived from the scrapy docs .该管道源自scrapy 文档

I have come across this same issue many times as well.我也多次遇到过同样的问题。 It has something to do with the path used in the FEEDS section of your custom settings.它与自定义设置的FEEDS部分中使用的路径有关。 There are a few things that you can try to fix the problem.您可以尝试一些方法来解决问题。

  1. Using absolute paths.使用绝对路径。

On linux/mac:在 Linux/Mac 上:

'FEEDS': {
    '/absolute/path/to/feeds/example/galleries.csv': {
    ...
}

on Windows you can try it a few ways:在 Windows 上,您可以尝试几种方法:

'FEEDS': {
    r'C:\Users\path\to\feeds\example\galleries.csv': {
        ...
}

or also on windows或者也可以拨打 windows

'FEEDS': {
    '/Users/path/to/feeds/example/galleries.csv': {
        ...
}
  1. Adding a schema to the FEEDS ( 'file///' ) path always seems to work even though it is mentioned in the documentation that it isn't necessary when exporting to local filesystem.将模式添加到 FEEDS ( 'file///' ) 路径似乎总是有效,即使在文档中提到它在导出到本地文件系统时不是必需的。
'FEEDS': {
    'file:///absolute/path/to/feeds/example/galleries.csv': {
        ...
}

Thanks for the response, I'm not sure if this would actually have fixed it.感谢您的回复,我不确定这是否真的能解决问题。 since the feed was working perfectly with relative paths when the pipeline is deactivated.因为当管道停用时,提要与相对路径完美配合。 I might test that anyways some time.无论如何,我可能会测试一下。

However, I figured out an other mistake in my code that fixed it without changing the paths: The docs state, that the process_item function must return an item object , return a twisted Deferred or raise a DropItem exception.但是,我在我的代码中发现了另一个错误,它在不更改路径的情况下修复了它: 文档state, process_item function 必须返回一个item object ,返回一个扭曲的Deferred或引发DropItem异常。 My code was derived from here but I missed the return statements in the lines calling the process_..._item functions.我的代码源自此处,但我错过了调用process_..._item函数的行中的 return 语句。

Tbh, I discovered the solution by accident trying to replicate my issue in a less complex spider and wrote up something like this and it worked:老实说,我无意中发现了解决方案,试图在一个不太复杂的蜘蛛中复制我的问题,并写下了这样的东西,它起作用了:

def process_item(self, item, spider):
    if isinstance(item, ExampleTagItem):
        adapter = ItemAdapter(item)
        if adapter['tag_id'] in self.tag_ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.tag_ids_seen.add(adapter['tag_id'])
        return item
    elif isinstance(item, ExampleGalleryItem):
        adapter = ItemAdapter(item)
        if adapter['id'] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter['id'])
        return item

Since I'm very new to coding: Any suggestions how to reduce the repetition in this code?由于我对编码很陌生:有什么建议可以减少这段代码中的重复吗? I could use "id" in both Item objects but still would need to differentiate between the two sets so no idea how to do this...我可以在两个 Item 对象中使用“id”,但仍然需要区分这两个集合,所以不知道该怎么做......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM