I am checking from a word-list of approx 2.1 Million keywords with the module nltk for good English words. The words are read from a text file, then checked for being a correct English word and then write the good one to a text file. The scripts works well, however is ridiculously slow, approx 7 iterations per second. Is there any faster way to do this?
Here is my code:
import nltk
from nltk.corpus import words
from tqdm import tqdm
total_size = 2170503
with open('two_words.txt','r',encoding='utf-8') as file:
for word in tqdm(file,total=total_size):
word = word.strip()
if all([w in words.words() for w in word.split()]):
with open('good_two.txt', 'a', encoding='utf-8') as file:
file.write(word)
file.write('\n')
else:
pass
Is there any faster way of doing the same? IE by using wordnet or any other suggestion?
from nltk.corpus import words
import time
# Test Text
text = "she sell sea shell by the seashore"
# Original Method
start = time.time()
x = all([w in words.words() for w in "she sell sea shell by the seashore".split()])
print("Duration Original Method: ", time.time() - start)
# Time to convert words to set
start = time.time()
set_words = set(words.words())
print("Time to generate set: ", time.time() - start)
# Test Using Set (Singe iteration)
start = time.time()
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 1 iteration: ", time.time() - start)
# Test Using Set (10, 000 iterations)
start = time.time()
for k in range(100000):
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 100, 000 iterations: ", time.time() - start)
Results shows using set ~200,000 faster. This is related to words.words() having 236, 736 elements, thus n ~ 236, 736 But, we have reduced the time from O(n) per lookup to O(1) by using sets
Duration Original Method: 0.601 seconds
Time to generate set: 0.131 seconds
Set using 1 iteration: 0.0 seconds
Set using 100, 000 iterations: 0.304 seconds
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.