简体   繁体   English

python文本文件读取很慢

[英]python text file reading is slow

How can I make this python program read a big text file faster? 如何让这个python程序更快地读取大文本文件? My code takes almost five minutes to read the text file, but I need it to do that much faster. 我的代码花了将近五分钟来读取文本文件,但我需要它来更快地完成这项工作。 I think my algorithm is not in O(n). 我认为我的算法不在O(n)中。

Some sample data (the actual data is 470K+ rows): 一些样本数据( 实际数据是470K +行):

Aarika
Aaron
aaron
Aaronic
aaronic
Aaronical
Aaronite
Aaronitic
Aaron's-beard
Aaronsburg
Aaronson

My code: 我的代码:

import string
import re


WORDLIST_FILENAME = "words.txt"

def load_words():
 wordlist = []
 print("Loading word list from file...")
 with open(WORDLIST_FILENAME, 'r') as f:
     for line in f:
         wordlist = wordlist + str.split(line)
 print("  ", len(wordlist), "words loaded.")
 return wordlist

def find_words(uletters):
wordlist = load_words()
foundList = []

for word in wordlist:
    wordl = list(word)
    letters = list(uletters)
    count = 0
    if len(word)==7:
        for letter in wordl[:]:
            if letter in letters:
                wordl.remove(letter)
               # print("word left" + str(wordl))
                letters.remove(letter)                    
               # print(letters)
                count = count + 1
                #print(count)
                if count == 7:
                    print("Matched:" + word)
                    foundList = foundList + str.split(word)
foundList.sort()
result = ''
for items in foundList: 
      result = result + items + ','
print(result[:-1])


#Test cases
find_words("eabauea" "iveabdi")
#pattern =   "asa" " qlocved"
#print("letters to look for: "+ pattern)
#find_words(pattern)

Read the single-column file into a list with splitlines() : 将单列文件读入带有splitlines()的列表:

def load_words():
    with open("words.txt", 'r') as f:
        wordlist = f.read().splitlines()
    return wordlist

You can benchmark it with timeit : 您可以使用timeit对其进行基准测试:

from timeit import timeit

timeit('load_words()', setup=setup, number=3)
# Output: 0.1708553659846075 seconds

As for how to implement what looks like a fuzzy matching algorithm, you might try fuzzywuzzy : 至于如何实现看起来像模糊匹配算法的东西,你可以尝试使用fuzzywuzzy

# pip install fuzzywuzzy[speedup]

from fuzzywuzzy import process

wordlist = load_words()
process.extract("eabauea", wordlist, limit=10)

Output: 输出:

[('-a', 90), ('A', 90), ('A.', 90), ('a', 90), ("a'", 90),
 ('a-', 90), ('a.', 90), ('AB', 90), ('Ab', 90), ('ab', 90)]

The results are more interesting if you filter for the longer matches: 如果您筛选更长的匹配项,结果会更有趣:

results = process.extract("eabauea", wordlist, limit=100)
[x for x in results if len(x[0]) > 4]

Output: 输出:

 [('abaue', 83),
 ('Ababua', 77),
 ('Abatua', 77),
 ('Bauera', 77),
 ('baulea', 77),
 ('abattue', 71),
 ('abature', 71),
 ('ablaqueate', 71),
 ('bauleah', 71),
 ('ebauche', 71),
 ('habaera', 71),
 ('reabuse', 71),
 ('Sabaean', 71),
 ('sabaean', 71),
 ('Zabaean', 71),
 ('-acea', 68)]

But with 470K+ rows it does take awhile: 但是有470K +行需要一段时间:

timeit('process.extract("eabauea", wordlist, limit=3)', setup=setup, number=3)
# Output: 384.97334043699084 seconds

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM