[英]Concat huge csv files using Dask
I'm trying to concat three csv files (8G,4G,6G respectively) into one csv file,我正在尝试将三个 csv 文件(分别为 8G、4G、6G)合并为一个 csv 文件,
and my memory is 16G, is there a way for me to concat these csv files on columns without having memory error ?我的内存是 16G,有没有办法让我在列上连接这些 csv 文件而不会出现内存错误?
My datasets are like我的数据集就像
A B C D E F G H I
1 2 3 4 5 6 7 8 9
My target is to merge them into我的目标是将它们合并成
A B C D E F G H I
...
My code is like我的代码就像
def combine_features(raw_feature_dir,connect_feature,time_feature_dir,feature_set):
df1 = dd.read_csv(raw_feature_dir)
df2 = dd.read_csv(connect_feature)
# df3 = dd.read_csv(time_feature_dir)
gc.collect()
df4 = df1.merge(df2)
df4.to_csv(feature_set)
I'm planning to merge two files first then merge the next one, but it shows memory error still我打算先合并两个文件然后合并下一个,但它仍然显示内存错误
Is there a way to merge huge csv files using Dask ?有没有办法使用 Dask 合并巨大的 csv 文件? or other tools
或其他工具
For example to compress the csv files then concat ?例如压缩 csv 文件然后 concat ? or to use a generator like read and write handler, that takes a chunk of data each time
或者使用像读写处理程序这样的生成器,每次都需要一大块数据
Thank you!谢谢!
I think you don't want to use merge but concat as stated in your question.我认为您不想使用合并,而是按照您的问题所述使用连接。
Find below a simple example :在下面找到一个简单的例子:
import pandas as pd
import dask.dataframe as dd
df1 = dd.from_pandas(pd.DataFrame({'A':[1,2,3],
'B':[1,4,3],
'C':[1,2,5]}),
npartitions=10)
df2 = dd.from_pandas(pd.DataFrame({'D':[0,2,3],
'E':[1,9,3],
'F':[1,6,5]}),
npartitions=10)
dd.concat([df1,df2], axis=1).head(5, npartitions=2)
Output:输出:
A B C D E F
0 1 1 1 0 1 1
1 2 4 2 2 9 6
2 3 3 5 3 3 5
A CSV is row-like storage, so it's not easy to append whole columns. CSV 是类似行的存储,因此附加整列并不容易。 One option, as mentioned in a comment, is to split your data into smaller chunks, add columns to the chunks of your CSV, and then append that chunk to a local CSV file you are building (on disk, not in memory).
正如评论中提到的,一种选择是将数据拆分为更小的块,将列添加到 CSV 的块中,然后将该块附加到您正在构建的本地 CSV 文件(在磁盘上,而不是在内存中)。
You could use skiprows
and nrows
options of the pandas read_csv method to read in a specific range of indices from your 3 files, combine into one dataframe in memory (representing a chunk of your desired CSV), and then append to the CSV you are building on disk.您可以使用 pandas read_csv方法的
skiprows
和nrows
选项从您的 3 个文件中读取特定范围的索引,组合成内存中的一个数据帧(代表您所需的 CSV 的一大块),然后附加到您正在构建的 CSV磁盘上。
Another option is to use a different storage format that might allow appending columns more efficiently.另一种选择是使用不同的存储格式,这可能允许更有效地附加列。 Dask seems to have a few options .
Dask 似乎有几个选项。
Dask also has a single_file
option for its to_csv method, but I don't think it will help in your case since you need to append columns. Dask 的to_csv方法也有一个
single_file
选项,但我认为它对您的情况没有帮助,因为您需要附加列。
I will assume that you have standard csv files.我假设你有标准的csv 文件。 The less memory consuming way is to only use the
csv
module.较少内存消耗的方法是仅使用
csv
模块。 That way you will process one line at a time:这样,您将一次处理一行:
def combine_features(raw_feature_dir,connect_feature,time_feature_dir,feature_set):
with open(raw_feature_dir) as fd1, open(connect_feature) as fd2, open(time_feature_dir) as fd3,open(feature_set, "w") as fdout:
fds = [fd1, fd2, fd3]
readers = [csv.reader(fdi) for fdi in fds]
writer = csv.writer(fdout)
try:
while True:
row = [field for field in r for r in [next(reader) for reader in readers]]
writer.writerow(row)
except StopIteration:
pass
Beware: above code assumes that:当心:上面的代码假设:
If those assumptions may be wrong, the code should:如果这些假设可能是错误的,代码应该:
Not shown here, because it would add a good deal of complexity while in most use cases the assumptions are reasonable...此处未显示,因为它会增加大量复杂性,而在大多数用例中,假设是合理的......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.