简体   繁体   English

如何在 Python 中拆分大文件 csv 文件(7GB)

[英]How can I split a large file csv file (7GB) in Python

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook.我有一个 7GB 的csv文件,我想将其拆分成更小的块,因此它在笔记本上用 Python 进行分析时可读且速度更快。 I would like to grab a small set from it, maybe 250MB, so how can I do this?我想从中获取一小部分,可能是 250MB,那么我该怎么做呢?

You don't need Python to split a csv file.您不需要 Python 来拆分 csv 文件。 Using your shell:使用你的外壳:

$ split -l 100 data.csv

Would split data.csv in chunks of 100 lines.data.csv分成 100 行的块。

I had to do a similar task, and used the pandas package:我不得不做一个类似的任务,并使用了熊猫包:

for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
    chunk.to_csv('chunk{}.csv'.format(i), index=False)

Maybe something like this?也许是这样的?

#!/usr/local/cpython-3.3/bin/python

import csv

divisor = 10

outfileno = 1
outfile = None

with open('big.csv', 'r') as infile:
    for index, row in enumerate(csv.reader(infile)):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
        writer.writerow(row)

请参阅有关file对象(由open(filename)返回的对象)的Python 文档- 您可以选择read指定数量的字节,或使用readline一次处理一行。

Here is a little python script I used to split a file data.csv into several CSV part files.这是我用来将文件data.csv为几个 CSV 部分文件的小 Python 脚本。 The number of part files can be controlled with chunk_size (number of lines per part file).部分文件的数量可以通过chunk_size (每个部分文件的行数)来控制。

The header line (column names) of the original file is copied into every part CSV file.原始文件的标题行(列名)被复制到每个部分的 CSV 文件中。

It works for big files because it reads one line at a time with readline() instead of loading the complete file into memory at once.它适用于大文件,因为它使用readline()一次读取一行,而不是一次将完整文件加载到内存中。

#!/usr/bin/env python3

def main():
    chunk_size = 9998  # lines

    def write_chunk(part, lines):
        with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
            f_out.write(header)
            f_out.writelines(lines)

    with open('data.csv', 'r') as f:
        count = 0
        header = f.readline()
        lines = []
        for line in f:
            count += 1
            lines.append(line)
            if count % chunk_size == 0:
                write_chunk(count // chunk_size, lines)
                lines = []
        # write remainder
        if len(lines) > 0:
            write_chunk((count // chunk_size) + 1, lines)

if __name__ == '__main__':
    main()

I agree with @jonrsharpe readline should be able to read one line at a time even for big files.我同意@jonrsharpe readline 应该能够一次读取一行,即使是大文件。

If you are dealing with big csv files might I suggest using pandas.read_csv .如果您正在处理大型 csv 文件,我可能会建议使用pandas.read_csv I often use it for the same purpose and always find it awesome (and fast).我经常将它用于相同的目的,并且总是觉得它很棒(而且速度很快)。 Takes a bit of time to get used to idea of DataFrames.需要一些时间来习惯 DataFrames 的概念。 But once you get over that it speeds up large operations like yours massively.但是一旦你克服了它,它就会大大加快像你这样的大型操作。

Hope it helps.希望它有帮助。

This graph shows the runtime difference of the different approaches outlined by other posters (on an 8 core machine when splitting a 2.9 GB file with 11.8 million rows of data into ~290 files).此图显示了其他海报概述的不同方法的运行时差异(在 8 核机器上,将具有 1180 万行数据的 2.9 GB 文件拆分为约 290 个文件时)。

在此处输入图片说明

The shell approach is from Thomas Orozco, Python approach s from Roberto, Pandas approach is from Quentin Febvre and here's the Dask snippet: shell 方法来自 Thomas Orozco,Python 方法来自 Roberto,Pandas 方法来自 Quentin Febvre,这里是 Dask 片段:

ddf = dd.read_csv("../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv", blocksize=10000000, dtype=dtypes)
ddf.to_csv("../tmp/split_csv_dask")

I'd recommend Dask for splitting files, even though it's not the fastest, because it's the most flexible solution (you can write out different file formats, perform processing operations before writing, easily modify compression formats, etc.).我推荐使用 Dask 来分割文件,尽管它不是最快的,因为它是最灵活的解决方案(您可以写出不同的文件格式,在写入之前执行处理操作,轻松修改压缩格式等)。 The Pandas approach is almost as flexible, but cannot perform processing on the entire dataset (like sorting the entire dataset before writing). Pandas 的方法几乎一样灵活,但不能对整个数据集进行处理(比如在写入之前对整个数据集进行排序)。

Bash / native Python filesystem operations are clearly quicker, but that's not what I'm typically looking for when I have a large CSV. Bash / 本机 Python 文件系统操作显然更快,但是当我拥有大型 CSV 时,这不是我通常要寻找的。 I'm typically interested in splitting large CSVs into smaller Parquet files, for performant, production data analyses.我通常对将大型 CSV 文件拆分为较小的 Parquet 文件感兴趣,以进行高性能的生产数据分析。 I don't usually care if the actually splitting takes a couple minutes more.我通常不在乎实际拆分是否需要多花几分钟时间。 I'm more interested in splitting accurately.我对准确分割更感兴趣。

I wrote a blog post that discusses this in more detail.我写了一篇博文,更详细地讨论了这一点。 You can probably Google around and find the post.您可能可以谷歌搜索并找到该帖子。

In the case of wanting to split by rough boundaries in bytes, the newest datapoints being the bottom-most ones and wanting to put the newest datapoints in the first file:在想要按字节粗略边界分割的情况下,最新的数据点是最底部的数据点并且想要将最新的数据点放在第一个文件中:

from pathlib import Path
    
TEN_MB = 10000000
FIVE_MB = 5000000

def split_file_into_chunks(path, chunk_size=TEN_MB):
    path = str(path)
    output_prefix = path.rpartition('.')[0]
    output_ext = path.rpartition('.')[-1]

    with open(path, 'rb') as f:
        seek_positions = []
        for x, line in enumerate(f):
            if not x:
                header = line
            seek_positions.append(f.tell())

        part = 0
        last_seek_pos = seek_positions[-1]
        for seek_pos in reversed(seek_positions):
            if last_seek_pos-seek_pos >= chunk_size:
                with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
                    f.seek(seek_pos)
                    f_out.write(header)
                    f_out.write(f.read(last_seek_pos-seek_pos))

                last_seek_pos = seek_pos
                part += 1

        with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
            f.seek(0)
            f_out.write(f.read(last_seek_pos))

    Path(path).rename(path+'~')
    Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
    Path(path+'~').unlink()

here is my code which might help这是我的代码,可能会有所帮助

import os
import pandas as pd
import uuid


class FileSettings(object):
    def __init__(self, file_name, row_size=100):
        self.file_name = file_name
        self.row_size = row_size


class FileSplitter(object):

    def __init__(self, file_settings):
        self.file_settings = file_settings

        if type(self.file_settings).__name__ != "FileSettings":
            raise Exception("Please pass correct instance ")

        self.df = pd.read_csv(self.file_settings.file_name,
                              chunksize=self.file_settings.row_size)

    def run(self, directory="temp"):

        try:os.makedirs(directory)
        except Exception as e:pass

        counter = 0

        while True:
            try:
                file_name = "{}/{}_{}_row_{}_{}.csv".format(
                    directory,  self.file_settings.file_name.split(".")[0], counter, self.file_settings.row_size, uuid.uuid4().__str__()
                )
                df = next(self.df).to_csv(file_name)
                counter = counter + 1
            except StopIteration:
                break
            except Exception as e:
                print("Error:",e)
                break

        return True


def main():
    helper =  FileSplitter(FileSettings(
        file_name='sample1.csv',
        row_size=10
    ))
    helper.run()

main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM