简体   繁体   English

关闭打开 scrapy CSV 导出管道中的 csv 文件

[英]Close open csv file in scrapy CSV Export Pipeline

I'm attempting to scrape articles on 100 companies, and I want to save the content from the multiple articles to a separate csv file for each company.我正在尝试抓取 100 家公司的文章,并且我想将多篇文章中的内容保存到每个公司的单独 csv 文件中。 I have the scraper and a csv export pipeline built, and it works fine, however, the spider opens a new csv file for each company (as it should) without closing the file opened for the previous company.我已经构建了刮刀和 csv 导出管道,并且工作正常,但是,蜘蛛为每个公司打开了一个新的 csv 文件(应该如此),而无需关闭为前一家公司打开的文件。

The csv files close after the spider closes, but because of the amount of data I am scraping for each company, the file sizes are significant and causes a strain on my machines memory, and cannot realistically scale, given that if I wanted to increase the number of companies (something I eventually want to do), I will, eventually run into an error for having too many files open at a time. csv 文件在蜘蛛关闭后关闭,但由于我为每家公司抓取的数据量很大,文件大小很大,会对我的机器 memory 造成压力,并且无法实际扩展,因为如果我想增加公司数量(我最终想做的事情),我最终会因为一次打开太多文件而遇到错误。 Below is my csv exporter pipeline.下面是我的 csv 出口管道。 I would like to find a way to close one csv file for the current company before moving on to the next company within the same spider:我想找到一种方法来关闭当前公司的 csv 文件,然后再转到同一蜘蛛中的下一家公司:

I guess, theoretically, I could open the file for each article, write the content to new rows, then close it and reopen it again for the next article, but that will slow the spider down significantly.我想,理论上,我可以为每篇文章打开文件,将内容写入新行,然后关闭它并为下一篇文章重新打开它,但这会显着减慢蜘蛛的速度。 I'd like to keep the file open for a given company while the spider is still making its way through that company's articles, then close it when the spider moves on to the next company.我想在蜘蛛仍在浏览该公司的文章时为给定公司保持文件打开状态,然后在蜘蛛移动到下一家公司时将其关闭。

I'm sure there is a solution but I have not been able to figure one out.我确定有一个解决方案,但我一直无法弄清楚。 Would greatly appreciate help solving this.非常感谢帮助解决这个问题。

class PerTickerCsvExportPipeline:
    """Distribute items across multiple CSV files according to their 'ticker' field"""

    def open_spider(self, spider):
        self.ticker_to_exporter = {}

    def close_spider(self, spider):
        for exporter in self.ticker_to_exporter.values():
            exporter.finish_exporting()

    def _exporter_for_item(self, item):
        ticker = item['ticker']
        if ticker not in self.ticker_to_exporter:
            f = open('{}_article_content.csv'.format(ticker), 'wb')
            exporter = CsvItemExporter(f)
            exporter.start_exporting()
            self.ticker_to_exporter[ticker] = exporter
        return self.ticker_to_exporter[ticker]

    def process_item(self, item, spider):
        exporter = self._exporter_for_item(item)
        exporter.export_item(item)
        return item

The problem probably is that you keep all the ItemExporter s and files open until the spider closes.问题可能是您保持所有ItemExporter和文件打开,直到蜘蛛关闭。 I suggest that you should try to close the CsvItemExporter and corresponding file for the previous company before you open a new one.我建议您在打开新公司之前,应尝试关闭以前公司的CsvItemExporter和相应文件。

def open_spider(self, spider):
    self.ticker_to_exporter = {}
    self.files = []

def close_exporters(self):
    for ticker, exporter in self.ticker_to_exporter.items():
        exporter.finish_exporting()
        del self.ticker_to_exporter[ticker]

def close_files(self):
    for i, f in enumerate(self.files):
        f.close()
        del self.files[i]

def close_spider(self, spider):
    self.close_exporters()
    self.close_files()

def _exporter_for_item(self, item):
    ticker = item['ticker']
    if ticker not in self.ticker_to_exporter:
        self.close_exporters()
        self.close_files()
        f = open('{}_article_content.csv'.format(ticker), 'a')
        self.files.append(f)
        exporter = CsvItemExporter(f)
        exporter.start_exporting()
        self.ticker_to_exporter[ticker] = exporter
    return self.ticker_to_exporter[ticker]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM