简体   繁体   中英

Scrapy exports empty feed with FEED_STORE_EMPTY == False (default)

I recently ran a spider that exported an empty feed even though I didn't change the FEED_STORE_EMPTY setting ( False by default). This is my story.

My spider looks like this, in part. It's extraction and parsing work as expected and are not shown.

import scrapy
from scrapy.utils.project import get_project_settings
from my_project.spiders import MySpider
from my_project.items import MyItem


class SpamSpider(MySpider):
    name = 'spam'
    feed_format = 'xml'
    proj_xml_path = get_project_settings()['OUTPUT_XML_PATH']

    custom_settings = {
        'FEED_URI' : proj_xml_path+'\%(name)s.xml',
        'FEED_FORMAT': feed_format,
    }

And my settings.py has contains the following relevant lines:

# When FEED_FORMAT is <key>, use <value> as exporter
FEED_EXPORTERS = {
    'xml' : 'my_project.exporters.XmlMyItemExporter'
}

# Output path
OUTPUT_XML_PATH = '\\\\this\path\works'

I've defined a custom exporter in exporters.py . It lives alongside settings.py , pipelines.py and the others in the my_project directory. It is very simple and it's purpose is mainly to provide custom names to the XML nodes.

from scrapy.exporters import XmlItemExporter

class XmlMyItemExporter(XmlItemExporter):

    def __init__(self, file, **kwargs):
        super().__init__(file, item_element='my_item', root_element='my_items', export_empty_fields=True, indent=2)

Nowhere do I change FEED_STORE_EMPTY , which allows an empty feed to be exported. The kwarg export_empty_fields is not the same; it doesn't apply to the feed as a whole, just to empty fields within individual items. I looked here and here to see if those two can affect each other at any point and I don't see how they can.

Further, if I log self.logger.debug(self.settings['FEED_STORE_EMPTY']) from within my spider, it shows False .

Nevertheless, if my crawl ( scrapy crawl spam ) gets a 500 on the start_request url, the spider closed after 2 retries, and I'm left with an empty feed. I can't really reproduce the 500 on demand, but I have this spider scheduled hourly for the purpose of uncovering issues like this. I don't want an empty feed because it will have negative consequences on my data pipeline, and I don't feel like I should have to handle an empty feed downstream when I should just be able to prevent it from being created.

I've run this spider 100 other times and it's performed as expected.

Thanks in advance for any insight you can lend.

In the case that a spider crawls and retrieves no items, it appears that the difference in behavior between having FEED_STORE_EMPTY set to True vs False is simply that the former will give you an xml feed that looks like this

<?xml version="1.0" encoding="utf-8"?>
<my_items>
</my_items>

whereas the latter will just give you an empty file.

In retrospect this makes sense because the file is initially created so items can be pushed into it during the crawl. I guess I was assuming that this empty file would be cleaned up / deleted if no items were scraped. With that in mind, I can explore the possibility of putting that functionality into an extension.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM