簡體   English   中英

分割scrapy的大型CSV文件

[英]Split scrapy's large CSV file

是否可以對每個不超過5000行的CSV文件進行草率寫入? 如何給它一個自定義的命名方案? 我應該修改CsvItemExporter嗎?

您在使用Linux嗎?

split命令在這種情況下非常有用。

split -l 5000  -d --additional-suffix .csv items.csv items-

有關選項,請參見split --help

嘗試以下管道:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exporters import CsvItemExporter

import datetime

class MyPipeline(object):

    def __init__(self, stats):
        self.stats = stats
        self.base_filename = "result/amazon_{}.csv"
        self.next_split = self.split_limit = 50000 # assuming you want to split 50000 items/csv
        self.create_exporter()  

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

    def create_exporter(self):
        now = datetime.datetime.now()
        datetime_stamp = now.strftime("%Y%m%d%H%M")
        self.file = open(self.base_filename.format(datetime_stamp),'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()       

    def process_item(self, item, spider):
        if (self.stats.get_stats()['item_scraped_count'] >= self.next_split):
            self.next_split += self.split_limit
            self.exporter.finish_exporting()
            self.file.close()
            self.create_exporter
        self.exporter.export_item(item)
        return item

不要忘記將管道添加到您的設置中:

ITEM_PIPELINES = {
   'myproject.pipelines.MyPipeline': 300,   
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM