優化粘貼循環

Question

我在/myfolder有1000個文件，每個文件都是〜8Mb，並且有500K行和2列，如下所示：

file1.txt
Col1 Col2
a 0.1
b 0.3
c 0.2
...

file2.txt
Col1 Col2
a 0.8
b 0.9
c 0.4
...

我需要從所有文件中刪除第一列Col1 ，並排粘貼所有文件，文件的順序無關緊要。

我正在運行以下代碼，它已經運行了4個小時...還是要加快速度？

for i in /myfolder/*; do \
paste all.txt <(cut -f2 ${i}) > temp.txt; \
mv temp.txt all.txt; \
done

預期產量：

all.txt
Col2 Col2 ...
0.1 0.8 ... 
0.3 0.9 ...
0.2 0.4 ...
... ... ...

Answer 1

我認為如果並行訪問文件，此任務將容易得多。 對於每個文件的每一行，您都只需要切除第一部分，然后打印結果的串聯。

在Python中，就像

import glob

# Open all *.txt files in parallel
files = [open(fn, 'r') for fn in glob.glob('*.txt')]
while True:
    # Try reading one line from each file, collecting into 'allLines'
    try:
        allLines = [next(f).strip() for f in files]
    except StopIteration:
        break

    # Chop off everything up to (including) the first space for each line
    secondColumns = (l[l.find(' ') + 1:] for l in allLines)

    # Print the columns, interspersing space characters
    print ' '.join(secondColumns)

_{^{allLines ，使allLines生成器似乎不起作用-出於某種原因， next調用不會引發StopIteration錯誤。}}

Answer 2

我不會完全回答。 但是，如果您嘗試這樣做，可能會成功。 例如：-根據第一列合並4個文件：

join -1 1 -2 1 temp1 temp2 | join - temp3|join - temp4

因此，您可以編寫腳本以最初將所有文件與命令框架在一起，然后最終執行命令。 希望這是有用的。

優化粘貼循環

問題描述

2 個解決方案

解決方案1
1 已采納 2014-03-04 09:41:50

解決方案2
0 2014-03-05 13:28:35

優化粘貼循環

問題描述

2 個解決方案

解決方案1 1 已采納 2014-03-04 09:41:50

解決方案2 0 2014-03-05 13:28:35

解決方案1
1 已采納 2014-03-04 09:41:50

解決方案2
0 2014-03-05 13:28:35