Python：快速遍歷文件

Question

我需要遍歷兩個文件數百萬次，計算整個文件中單詞對出現的次數。 （為了建立兩個單詞的列聯表以計算Fisher精確測試分數）

我目前正在使用

from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
    if w1 in x:
         w1count+=1
    if w2 in y:
         w2count+=1
    .....

雖然這還不錯，但是我想知道是否有更快的方法來遍歷兩個文件，希望可以更快。

非常感謝您的幫助。

Answer 1

我仍然不太了解您到底想做什么，但是這里有一些示例代碼可能會為您指明正確的方向。

我們可以使用字典或collections.Counter例如，通過文件來計數單通所有出現的單詞和對。 之后，我們只需要查詢內存中的數據。

import collections
import itertools
import re

def find_words(line):
    for match in re.finditer("\w+", line):
        yield match.group().lower()

counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()

with open("src.txt") as f1, open("tgt.txt") as f2:
    for line1, line2 in itertools.izip(f1, f2):
        words1 = list(find_words(line1))
        words2 = list(find_words(line2))
        counts1.update(words1)
        counts2.update(words2)
        counts_pairs.update(itertools.product(words1, words2))

print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]

Answer 2

通常，如果您的數據足夠小以適合內存，那么最好的選擇是：

將數據預處理到內存中
從內存結構迭代

如果文件很大，您可能可以預處理為數據結構（例如壓縮數據），並保存為類似pickle的格式，這樣可以更快地在單獨的文件中加載和使用該文件，然后進行處理。

Answer 3

就像開箱即用的思維解決方案一樣：您是否嘗試過將文件制作為Pandas數據框？ 也就是說，我假設您已經在輸入中制作了單詞列表（通過刪除閱讀符號，例如。和，）並使用input.split（''）或類似的東西。 然后可以將其放入DataFrames中，執行帶字數計數，然后進行笛卡爾聯接嗎？

import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()

df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()

df_1['link'] = 1
df_2['link'] = 1

result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']

我將這樣的東西用於籃子分析，效果很好。

Python：快速遍歷文件

問題描述

3 個解決方案

解決方案1
1 已采納 2013-10-17 11:03:24

解決方案2
0 2013-10-17 10:02:47

解決方案3
0 2013-10-17 10:18:40

Python：快速遍歷文件

問題描述

3 個解決方案

解決方案1 1 已采納 2013-10-17 11:03:24

解決方案2 0 2013-10-17 10:02:47

解決方案3 0 2013-10-17 10:18:40

解決方案1
1 已采納 2013-10-17 11:03:24

解決方案2
0 2013-10-17 10:02:47

解決方案3
0 2013-10-17 10:18:40