简体   繁体   English

Pandas将多个CSV和输出组合为一个大文件

[英]Pandas Combine Multiple CSV's and Output as One Large File

So I currently have a directory, we'll call it /mydir , that contains 36 CSV files, each 2.1 GB and with the same dimensions. 所以我目前有一个目录,我们称之为/ mydir ,它包含36个CSV文件,每个2.1 GB并具有相同的尺寸。 They are all the same size, and I want to read them into pandas, concatenate them together side-by-side (so the amount of rows stays the same), and then output the resulting dataframe as one large csv. 它们都是相同的大小,我想将它们读入熊猫,并排将它们连接在一起(因此行数保持不变),然后将结果数据帧输出为一个大的csv。 The code I have for this works for combining a few of them but reaches a memory error after a certain point. 我的代码用于组合其中的一些但在某个点之后达到内存错误。 I was wondering if there is a more efficient way to do this than what I have. 我想知道是否有更有效的方法来做到这一点,而不是我拥有的。

df = pd.DataFrame()
for file in os.listdir('/mydir'):
    df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')

It was suggested to me to break it up into smaller pieces, combine the files in groups of 6, then combine these together in turn but I don't know if this is a valid solution that will avoid the memory error problem 有人建议我把它分成小块,将文件组合成6组,然后依次将它们组合在一起,但我不知道这是否是一个有效的解决方案,可以避免内存错误问题

EDIT: view of the directory: 编辑:目录的视图:

-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv

Chunk Them All! 大块他们全部!

from glob import glob
import os

# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')

# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
    return pd.read_csv(f, index_col=0, chunksize=n)

# simplify the concatenation
def concat(lot):
    return pd.concat(lot, axis=1)

# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
    if os.path.exists(f):
        mode = 'a'
        header = False
    else:
        mode = 'w'
        header = True
    df.to_csv(f, mode=mode, header=header)

# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
    to_csv('out.csv', concat(lot))

Assuming the answer to MaxU is that all the files have the same number of rows, and assuming further that minor CSV differences like quoting are done the same way in all the files, you don't need to do this with Pandas. 假设MaxU的答案是所有文件具有相同的行数,并且假设在所有文件中进一步假设像引用这样的微小CSV差异,则不需要使用Pandas执行此操作。 Regular file readlines will give you the strings that you can concatenate and write out. 常规文件readlines行将为您提供可以连接和写出的字符串。 Assuming further that you can supply the number of rows. 进一步假设您可以提供行数。 Something like this code: 像这样的代码:

    numrows = 999 # whatever.  Probably pass as argument to function or on cmdline
    out_file = open('myout.csv','w')
    infile_names = [ 'file01.csv',
                     'file02.csv',
                      ..
                     'file36.csv' ]

    # open all the input files
    infiles = []
    for fname in infile_names:
        infiles.append(open(fname))

    for i in range(numrows):
        # read a line from each input file and add it to the output string
        out_csv=''
        for infile2read in infiles:
            out_csv += infile2read.readline().strip() + ','
        out_csv[-1] = '\n' # replace final comma with newline

        # write this rows data out to the output file
        outfile.write(out_csv)

    #close the files
    for f in infiles:
        f.close()
    outfile.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM