從python並行執行bash任務，提取csv列

Question

我有一個包含 7,221,032 列和 37 行的 csv 文件。 我需要將每列 map 放到一個單獨的文件中，最好是來自 python 腳本。 到目前為止我的嘗試：

num_features = 7221032
binary_dir = "data_binary"

command_template = command = 'awk -F "\\"*,\\"*" \'{print $%s}\' %s/images_binary.txt > %s/feature_files/pixel_%s.vector'

batch_size = 100

batch_indexes = np.arange(1, num_features, batch_size)

for batch_index in batch_indexes[1:5]:

    indexes = range(batch_index-batch_size, batch_index)

    commands = [command_template % (str(i), binary_dir, binary_dir, str(i)) for i in indexes]
    map(os.system, commands)

但是，這似乎是一個相當緩慢的過程。關於如何加快它的任何建議？

Answer 1

修改后的解決方案 - 使用 Perl

運行 perl prog.pl < /path/to/images_binary.txt

100,000 個項目的運行時間為 10 秒。 完整的數據集大約需要 7 個小時。 不確定並行運行會更好，因為瓶頸是文件的打開/關閉。 提高性能的最佳選擇是減少生成文件的數量，以某種方式首先按列順序寫入輸入。

#! /usr/bin/perl

while ( my $x = <> ) {
    chomp $x ;
    my @v = split(',', $x) ;
    foreach my $i (0..$#v) {
        open OF, ">data_binary/feature_files/pixel_$i.vector" ;
        print OF $v[$i], "\n" ;
        close OF ;
    } ;
} ;

從python並行執行bash任務，提取csv列

問題描述

1 個解決方案

解決方案1
0 2019-11-16 20:36:17

從python並行執行bash任務，提取csv列

問題描述

1 個解決方案

解決方案1 0 2019-11-16 20:36:17

解決方案1
0 2019-11-16 20:36:17