简体   繁体   English

如何有效地将 csv 文件水平连接,然后垂直连接?

[英]How to efficiently concat csv files in dask horizontally, then vertically?

Given 3 csv files of the same number of rows, like these给定 3 个相同行数的 csv 文件,像这些

fx.csv : fx.csv

7.23,4.41,0.17453,0.12
6.63,3.21,0.3453,0.32
2.27,2.21,0.3953,0.83

f0.csv : f0.csv

1.23,3.21,0.123,0.12
8.23,9.21,0.183,0.32
7.23,6.21,0.123,0.12

and f1.csv :f1.csv

6.23,3.21,0.153,0.123
2.23,2.26,0.182,0.22
9.23,9.21,0.183,0.135

The f0.csv and f1.csv come with corresponding labels 0 s and 1 s. f0.csvf1.csv带有相应的标签0 s 和1 s。

The goal is to read into a dask.DataFrame .目标是读入dask.DataFrame The concatenated values such that we get我们得到的级联值

  1. fx.csv concatenated horizontally with f0.csv and 0 s fx.csvf0.csv0 s 水平连接
  2. fx.csv concatenated horizontally with f1.csv and 1 s fx.csvf1.csv1 s 水平连接
  3. concatenated (1) and (2) vertically垂直连接 (1) 和 (2)

I have tried doing this to read them into the dask file and save into a hdf store:我尝试这样做以将它们读入 dask 文件并保存到 hdf 存储中:

import dask.dataframe as dd
import dask.array as da

fx = dd.read_csv('fx.csv', header=None)
f0 = dd.read_csv('f0.csv', header=None)
f1 = dd.read_csv('f1.csv', header=None)

l0 = dd.from_array(np.array([1] * len(fx)))
l1 = dd.from_array(np.array([1] * len(fx)))

da.to_np_stack('data/', 
  da.concatenate( [
    dd.concat([fx.compute(), f0.compute(), l0.compute()], axis=1),
    dd.concat([fx.compute(), f1.compute(), l1.compute()], axis=1)
    ], axis=0, allow_unknown_chunksizes=True),
  axis=0)

I can also do these in unix before reading it into the dask file, like this:我也可以在 unix 中执行这些操作,然后再将其读入 dask 文件,如下所示:

# Create the label files.
$ wc -l fx.csv
4

$ seq 4 | sed "c 0" > l0.csv
$ seq 4 | sed "c 0" > l1.csv

# Concat horizontally
$ paste fx.csv f0.csv l0.csv -d"," > x0.csv
$ paste fx.csv f1.csv l1.csv -d"," > x1.csv

$ cat x0.csv x1.csv > data.csv

The actual dataset has 256 columns for each f*.csv files and 22,000,000 rows.每个f*.csv文件的实际数据集有 256 列和 22,000,000 行。 So it isn't easy to run the dask python code.所以运行 dask python 代码并不容易。

My questions (in parts are):我的问题(部分是):

  1. Is the dask method in the Python code the easiest/memory efficient method to read the data and output it into a hdf5 store? Python 代码中的 dask 方法是读取数据和 output 到 hdf5 存储的最简单/内存有效的方法吗?

  2. Is there any other method that is more efficient than the unix way described above?有没有比上述 unix 方式更有效的方法?

The code below is a modified version of your snippet.下面的代码是您的代码段的修改版本。

When reading csv, the allocation of lines across partitions is based on a chunk size, so basic concat operations are not guaranteed to work out of the box because the partitions might not be aligned.读取 csv 时,跨分区的行分配基于块大小,因此不能保证基本的 concat 操作开箱即用,因为分区可能未对齐。 To resolve it, index the data.要解决它,请索引数据。

Next, creating columns of 0/1s can be done using .assign method (works same as in pandas ).接下来,可以使用.assign方法创建 0/1 列(与pandas中的工作方式相同)。 Before saving the array, you might also want to rechunk as described in this answer , but that's optional.在保存数组之前,您可能还希望按照此答案中的描述重新分块,但这是可选的。

import dask.dataframe as dd
import dask.array as da

def _index_ddf(df):
   """Generate a unique row-based index. See also https://stackoverflow.com/a/65839787/10693596"""
   df['new_index'] = 1
   df['new_index'] = df['new_index'].cumsum()
   df = df.set_index('new_index', sorted=True)
   return df

fx = dd.read_csv('fx.csv', header=None)
fx = _index_ddf(fx)

f0 = dd.read_csv('f0.csv', header=None)
f0 = _index_ddf(f0)

f1 = dd.read_csv('f1.csv', header=None)
f1 = _index_ddf(f1)

# columns of 0/1 can be created by assignment
A1 = dd.concat([fx, f0], axis=1).assign(zeros=0).to_dask_array(lengths=True)
A2 = dd.concat([fx, f1], axis=1).assign(ones=1).to_dask_array(lengths=True)

# stack
A = da.concatenate([A1, A2], axis=0)

# save
da.to_npy_stack('data/', A, axis=0)

#optional: to have even sized chunks, can rechunk the data, see https://stackoverflow.com/a/73218995/10693596

You can read files line by line and make new.csv by them instead of loading all of data in your ram at first.您可以逐行读取文件并通过它们制作 new.csv 而不是首先将所有数据加载到您的 ram 中。 Below code do it for you:下面的代码为你做:

FILE_PATHS = [
    '/home/amir/data/1.csv',
    '/home/amir/data/2.csv',
    '/home/amir/data/3.csv',
]
NEW_FILE_PATH = '/home/amir/data/new.csv'

fout = open(NEW_FILE_PATH, 'w')
for file_path in FILE_PATHS:
    with open(file_path, 'r') as fin:
        for line in fin:
            fout.write(line)

About your questions:关于您的问题:

  1. as long you read files line by line its efficient no matter in what language.只要您逐行阅读文件,无论使用哪种语言,它都是有效的。
  2. You really should try pyspark.你真的应该试试 pyspark。 it reads, transforms, writes data in parallel and in a very genius way:)它以一种非常天才的方式并行读取、转换、写入数据:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM