简体   繁体   English


[英]Split big csv file by the value of a column in python

I have a csv large file that I cannot handle in memory with python. 我有一个csv大文件,我无法在内存中使用python处理。 I am splitting it into multiple chunks after grouping by the value of a specific column, using the following logic: 在使用以下逻辑对特定列的值进行分组后,我将其拆分为多个块:

def splitDataFile(self, data_file):

    self.list_of_chunk_names = []
    csv_reader = csv.reader(open(data_file, "rb"), delimiter="|")
    columns = csv_reader.next()

    for key,rows in groupby(csv_reader, lambda row: (row[1])):
        file_name = "data_chunk"+str(key)+".csv"

        with open(file_name, "w") as output:
            for row in rows:

    print "message: list of chunks ", self.list_of_chunk_names


The logic is working but it's slow. 逻辑正在运行,但速度很慢。 I am wondering how can I optimize this? 我想知道如何优化这个? For instance with pandas? 比如熊猫?

Edit 编辑

Further explanation: I am not looking for a simple splitting to same size chunks (like each one having 1000 rows), I want to split by the value of a column, that's why I am using groupby. 进一步的解释:我不是在寻找一个简单的分割到相同大小的块(比如每个有1000行),我想用列的值进行拆分,这就是我使用groupby的原因。

Use this Python 3 program: 使用这个Python 3程序:

 #!/usr/bin/env python3
 import binascii
 import csv
 import os.path
 import sys
 from tkinter.filedialog import askopenfilename, askdirectory
 from tkinter.simpledialog import askinteger

 def split_csv_file(f, dst_dir, keyfunc):
     csv_reader = csv.reader(f)
     csv_writers = {}
     for row in csv_reader:
         k = keyfunc(row)
         if k not in csv_writers:
             csv_writers[k] = csv.writer(open(os.path.join(dst_dir, k),
                                              mode='w', newline=''))

 def get_args_from_cli():
     input_filename = sys.argv[1]
     column = int(sys.argv[2])
     dst_dir = sys.argv[3]
     return (input_filename, column, dst_dir)

 def get_args_from_gui():
     input_filename = askopenfilename(
         filetypes=(('CSV', '.csv'),),
         title='Select CSV Input File')
     column = askinteger('Choose Table Column', 'Table column')
     dst_dir = askdirectory(title='Select Destination Directory')
     return (input_filename, column, dst_dir)

 if __name__ == '__main__':
     if len(sys.argv) == 1:
         input_filename, column, dst_dir = get_args_from_gui()
     elif len(sys.argv) == 4:
         input_filename, column, dst_dir = get_args_from_cli()
         raise Exception("Invalid number of arguments")
     with open(input_filename, mode='r', newline='') as f:
         split_csv_file(f, dst_dir, lambda r: r[column-1]+'.csv')
         # if the column has funky values resulting in invalid filenames
         # replace the line from above with:
         # split_csv_file(f, dst_dir, lambda r: binascii.b2a_hex(r[column-1].encode('utf-8')).decode('utf-8')+'.csv')

Save it as split-csv.py and run it from Explorer or from the command line. 将其保存为split-csv.py并从资源管理器或命令行运行它。

For example to split superuser.csv based off column 1 and write the output files under dstdir use: 例如,基于第1列拆分superuser.csv并在dstdir下编写输出文件使用:

 python split-csv.py data.csv 1 dstdir

If you run it without arguments, a Tkinter based GUI will prompt you to choose the input file, the column (1 based index) and the destination directory. 如果不带参数运行它,基于Tkinter的GUI将提示您选择输入文件,列(基于1的索引)和目标目录。

ref REF

I am going with something like the following, where I am iterating over the unique values of the column to split by, to filter the data chunks. 我将使用类似下面的内容,我将迭代要拆分的列的唯一值,以过滤数据块。

def splitWithPandas(data_file, split_by_column):
        values_to_split_by = pd.read_csv(data_file, delimiter="|", usecols=[split_by_column])
        values_to_split_by = pd.unique(values_to_split_by.values.ravel())

        for i in values_to_split_by:
            iter_csv = pd.read_csv(data_file, delimiter="|", chunksize=100000)
            df = pd.concat([chunk[chunk[split_by_column] == i] for chunk in iter_csv])
            df.to_csv("data_chunk_"+i, sep="|", index=False)

You will probably get the best performance by using the builtin chunking features of pandas (the chunksize keyword arg to read_csv ), 通过使用pandas的内置分块功能( chunksize关键字arg to read_csv ),您可能会获得最佳性能,

http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html

For example, 例如,

reader = pd.read_table('my_data.csv', chunksize=4)
for chunk in reader:

EDIT: 编辑:

This might get you somewhere, 这可能会让你到处

import pandas as pd

group_col_indx = 1
group_col = pd.read_csv('test.csv', usecols=[group_col_indx])
keys = group_col.iloc[:,0].unique()

for key in keys:
    df_list = []
    reader = pd.read_csv('test.csv', chunksize=2)
    for chunk in reader:
        good_rows = chunk[chunk.iloc[:,group_col_indx] == key]
    df_key = pd.concat(df_list)

I suspect that your biggest bottleneck is opening and closing a file handle every time you process a new block of rows. 怀疑每次处理新的行块时,最大的瓶颈是打开和关闭文件句柄。 A better approach, as long as the number of files you write to is not too large, is to keep all the files open. 只要您写入的文件数量不是太大,更好的方法是保持所有文件都打开。 Here's an outline: 这是一个大纲:

def splitDataFile(self, data_file):
    open_files = dict()
    input_file = open(data_file, "rb")
        csv_reader = csv.reader(input_file, ...)
        for key, rows in groupby(csv_reader, lambda row: (row[1])):
                output = open_files[key]
            except KeyError:
                output = open(file_name, "w")
        for open_file in open_files.itervalues():

Of course, if you only have one group with any given key, this will not help. 当然,如果您只有一个组具有任何给定的密钥,这将无济于事。 (Actually it may make things worse, because you wind up holding a bunch of files open unnecessarily.) The more often you wind up writing to a single file, the more of a benefit you'll get from this change. (实际上它可能会让事情变得更糟,因为你最终会不必要地打开一堆文件。)你越频繁地写一个文件,你就会从这个变化中获得更多的好处。

You can combine this with pandas, if you want, and use the chunking features of read_csv or read_table to handle the input processing. 如果需要,可以将它与pandas结合使用,并使用read_csvread_table的分块功能来处理输入处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM