简体   繁体   English

使用 Dask 连接巨大的 csv 文件

[英]Concat huge csv files using Dask

I'm trying to concat three csv files (8G,4G,6G respectively) into one csv file,我正在尝试将三个 csv 文件(分别为 8G、4G、6G)合并为一个 csv 文件,

and my memory is 16G, is there a way for me to concat these csv files on columns without having memory error ?我的内存是 16G,有没有办法让我在列上连接这些 csv 文件而不会出现内存错误?

My datasets are like我的数据集就像

A  B  C             D   E   F           G    H    I
1  2  3             4   5   6           7    8    9

My target is to merge them into我的目标是将它们合并成

A  B  C  D  E  F  G  H  I 
  ...

My code is like我的代码就像

def combine_features(raw_feature_dir,connect_feature,time_feature_dir,feature_set):
df1 = dd.read_csv(raw_feature_dir)
df2 = dd.read_csv(connect_feature)
# df3 = dd.read_csv(time_feature_dir)

gc.collect()
df4 = df1.merge(df2)

df4.to_csv(feature_set)

I'm planning to merge two files first then merge the next one, but it shows memory error still我打算先合并两个文件然后合并下一个,但它仍然显示内存错误

Is there a way to merge huge csv files using Dask ?有没有办法使用 Dask 合并巨大的 csv 文件? or other tools或其他工具

For example to compress the csv files then concat ?例如压缩 csv 文件然后 concat ? or to use a generator like read and write handler, that takes a chunk of data each time或者使用像读写处理程序这样的生成器,每次都需要一大块数据

Thank you!谢谢!

I think you don't want to use merge but concat as stated in your question.我认为您不想使用合并,而是按照您的问题所述使用连接。

Find below a simple example :在下面找到一个简单的例子:

import pandas as pd
import dask.dataframe as dd

df1 = dd.from_pandas(pd.DataFrame({'A':[1,2,3],
                                   'B':[1,4,3], 
                                   'C':[1,2,5]}), 
                                    npartitions=10)
df2 = dd.from_pandas(pd.DataFrame({'D':[0,2,3], 
                                   'E':[1,9,3], 
                                   'F':[1,6,5]}), 
                                    npartitions=10)

dd.concat([df1,df2], axis=1).head(5, npartitions=2)

Output:输出:

   A  B  C  D  E  F
0  1  1  1  0  1  1
1  2  4  2  2  9  6
2  3  3  5  3  3  5

A CSV is row-like storage, so it's not easy to append whole columns. CSV 是类似行的存储,因此附加整列并不容易。 One option, as mentioned in a comment, is to split your data into smaller chunks, add columns to the chunks of your CSV, and then append that chunk to a local CSV file you are building (on disk, not in memory).正如评论中提到的,一种选择是将数据拆分为更小的块,将列添加到 CSV 的块中,然后将该块附加到您正在构建的本地 CSV 文件(在磁盘上,而不是在内存中)。

You could use skiprows and nrows options of the pandas read_csv method to read in a specific range of indices from your 3 files, combine into one dataframe in memory (representing a chunk of your desired CSV), and then append to the CSV you are building on disk.您可以使用 pandas read_csv方法的skiprowsnrows选项从您的 3 个文件中读取特定范围的索引,组合成内存中的一个数据帧(代表您所需的 CSV 的一大块),然后附加到您正在构建的 CSV磁盘上。

Another option is to use a different storage format that might allow appending columns more efficiently.另一种选择是使用不同的存储格式,这可能允许更有效地附加列。 Dask seems to have a few options . Dask 似乎有几个选项

Dask also has a single_file option for its to_csv method, but I don't think it will help in your case since you need to append columns. Dask 的to_csv方法也有一个single_file选项,但我认为它对您的情况没有帮助,因为您需要附加列。

I will assume that you have standard csv files.我假设你有标准的csv 文件。 The less memory consuming way is to only use the csv module.较少内存消耗的方法是仅使用csv模块。 That way you will process one line at a time:这样,您将一次处理一行:

def combine_features(raw_feature_dir,connect_feature,time_feature_dir,feature_set):
    with open(raw_feature_dir) as fd1, open(connect_feature) as fd2, open(time_feature_dir) as fd3,open(feature_set, "w") as fdout:
        fds = [fd1, fd2, fd3]
        readers = [csv.reader(fdi) for fdi in fds]
        writer = csv.writer(fdout)
        try:
            while True:
                row = [field for field in r for r in [next(reader) for reader in readers]]
                writer.writerow(row)
        except StopIteration:
            pass

Beware: above code assumes that:当心:上面的代码假设:

  • all the rows in all the input csv files are correct (no row with a different number of fields than the header of same file)所有输入 csv 文件中的所有行都是正确的(没有行的字段数与同一文件的标题不同)
  • all the csv files have the same length所有 csv 文件的长度相同

If those assumptions may be wrong, the code should:如果这些假设可能是错误的,代码应该:

  • store the lenght of the first line of each file存储每个文件第一行的长度
  • for each row to have same length by adding empty fields if the row if too short or truncating it if it is too long如果行太短,则通过添加空字段来使每行具有相同的长度,如果行太长则截断它
  • wait for end of the longest file instead of the shorter one.等待最长文件的结尾而不是较短的文件。

Not shown here, because it would add a good deal of complexity while in most use cases the assumptions are reasonable...此处未显示,因为它会增加大量复杂性,而在大多数用例中,假设是合理的......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM