简体   繁体   中英

Python: fast iteration through file

I need to iterate through two files many million times, counting the number of appearances of word pairs throughout the files. (in order to build contingency table of two words to calculate Fisher's Exact Test score)

I'm currently using

from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
    if w1 in x:
         w1count+=1
    if w2 in y:
         w2count+=1
    .....

While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.

I appreciate your help in advance.

I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.

We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.

import collections
import itertools
import re

def find_words(line):
    for match in re.finditer("\w+", line):
        yield match.group().lower()

counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()

with open("src.txt") as f1, open("tgt.txt") as f2:
    for line1, line2 in itertools.izip(f1, f2):
        words1 = list(find_words(line1))
        words2 = list(find_words(line2))
        counts1.update(words1)
        counts2.update(words2)
        counts_pairs.update(itertools.product(words1, words2))

print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]

In general if your data is small enough to fit into memory then your best bet is to:

  1. Pre-process data into memory

  2. Iterate from memory structures

If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.

Just as an out of the box thinking solution: Have you tried making the files into Pandas data frames? Ie I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?

import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()

df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()

df_1['link'] = 1
df_2['link'] = 1

result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']

I use stuff like this for basket analysis, works really well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM