處理大文件 - python或命令行的建議？

Question

給定兩個文件，一個包含表單的條目：

label1 label2 name1
label1 label3 name2

和另一個形式：

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

假設你想從文件二中提取那些前三個元素出現在文件一行中的那些行（重要的順序） - 任何有關如何快速填充的建議？

給定上述示例數據的任何此類腳本的輸出文件將是：

label1 label2 name1 0.1 1000

我玩弄python：

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

然而，這似乎需要一段時間。 這些文件相當大。

謝謝！

Answer 1

你的python版本是相當低效的，因為你正在測試列表中的成員資格，而不是集合或字典（即O（n）查找時間而不是O（1））。

嘗試使用一set元組或一set字符串。 元組是一個更好的選擇，因為這兩個文件可以分成不同的分隔符，但我認為你不會看到特別大的性能差異。 與測試很長列表的成員資格相比， tuple('something'.split())相對較快。

此外，無需調用inp.readlines() 。 換句話說，你可以做到

look_up = set(tuple(line.split()) for line in inp)

除了tuple(line[:3])而不是[line[0], line[1], line[2]]您應該看到顯着的加速而不必更改代碼的任何其他部分。

實際上，grep和bash對於這個來說非常完美......（未經測試，但它應該可以工作。）

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

要查看哪一個更快，我們可以生成一些測試數據（ file1.txt ~4500個鍵和file2.txt 1000000個行），並對同一個東西的簡單python版本進行基准測試（大致......這些行將打印出來與grep版本不同的順序。）

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

python版本的速度提高了約70倍：

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

與

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

當然，這兩個例子正以完全不同的方式解決問題。 我們真的在比較兩種算法，而不是兩種算法。 例如，如果我們在file1只有幾個關鍵行，那么bash / grep解決方案很容易獲勝。

（bash有一個內置的容器，有O（1）查找成員資格嗎？（我認為bash 4可能有一個哈希表，但我對它一無所知......）嘗試實現它會很有趣與bash中上面的python示例類似的算法，以及...）

Answer 2

Hacky bash / sort / Perl解決方案：

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

它將兩個文件合並到一個列表中，對其進行排序（因此我們使用相同的密鑰從文件1中逐行獲取數據塊），然后使用特殊的Perl oneliner僅保留格式在文件“header”之前的格式良好的行1。

Answer 3

您可以嘗試使用字符串“label1 label2 name1”作為鍵，而不是使用值的三元組。

Answer 4

我使用哈希來存儲第一個文件的值。 雖然不是錯誤恢復（每個項目之間只有1個空格），但你會得到一般的想法......

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line

處理大文件 - python或命令行的建議？

問題描述

4 個解決方案

解決方案1
8 已采納 2011-09-06 20:57:08

解決方案2
3 2011-09-06 21:00:44

解決方案3
1 2011-09-06 20:52:34

解決方案4
1 2011-09-06 20:56:53

處理大文件 - python或命令行的建議？

問題描述

4 個解決方案

解決方案1 8 已采納 2011-09-06 20:57:08

解決方案2 3 2011-09-06 21:00:44

解決方案3 1 2011-09-06 20:52:34

解決方案4 1 2011-09-06 20:56:53

解決方案1
8 已采納 2011-09-06 20:57:08

解決方案2
3 2011-09-06 21:00:44

解決方案3
1 2011-09-06 20:52:34

解決方案4
1 2011-09-06 20:56:53