简体   繁体   English

Python:快速遍历文件

[英]Python: fast iteration through file

I need to iterate through two files many million times, counting the number of appearances of word pairs throughout the files. 我需要遍历两个文件数百万次,计算整个文件中单词对出现的次数。 (in order to build contingency table of two words to calculate Fisher's Exact Test score) (为了建立两个单词的列联表以计算Fisher精确测试分数)

I'm currently using 我目前正在使用

from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
    if w1 in x:
         w1count+=1
    if w2 in y:
         w2count+=1
    .....

While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster. 虽然这还不错,但是我想知道是否有更快的方法来遍历两个文件,希望可以更快。

I appreciate your help in advance. 非常感谢您的帮助。

I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction. 我仍然不太了解您到底想做什么,但是这里有一些示例代码可能会为您指明正确的方向。

We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. 我们可以使用字典或collections.Counter例如,通过文件来计数单通所有出现的单词和对。 After that, we only need to query the in-memory data. 之后,我们只需要查询内存中的数据。

import collections
import itertools
import re

def find_words(line):
    for match in re.finditer("\w+", line):
        yield match.group().lower()

counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()

with open("src.txt") as f1, open("tgt.txt") as f2:
    for line1, line2 in itertools.izip(f1, f2):
        words1 = list(find_words(line1))
        words2 = list(find_words(line2))
        counts1.update(words1)
        counts2.update(words2)
        counts_pairs.update(itertools.product(words1, words2))

print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]

In general if your data is small enough to fit into memory then your best bet is to: 通常,如果您的数据足够小以适合内存,那么最好的选择是:

  1. Pre-process data into memory 将数据预处理到内存中

  2. Iterate from memory structures 从内存结构迭代

If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that. 如果文件很大,您可能可以预处理为数据结构(例如压缩数据),并保存为类似pickle的格式,这样可以更快地在单独的文件中加载和使用该文件,然后进行处理。

Just as an out of the box thinking solution: Have you tried making the files into Pandas data frames? 就像开箱即用的思维解决方案一样:您是否尝试过将文件制作为Pandas数据框? Ie I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. 也就是说,我假设您已经在输入中制作了单词列表(通过删除阅读符号,例如。和,)并使用input.split('')或类似的东西。 That you can then make into DataFrames, perform a wordd count and then make a cartesian join? 然后可以将其放入DataFrames中,执行带字数计数,然后进行笛卡尔联接吗?

import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()

df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()

df_1['link'] = 1
df_2['link'] = 1

result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']

I use stuff like this for basket analysis, works really well. 我将这样的东西用于篮子分析,效果很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM