简体   繁体   中英

Is there any faster way to check from a words-list with nltk with python?

I am checking from a word-list of approx 2.1 Million keywords with the module nltk for good English words. The words are read from a text file, then checked for being a correct English word and then write the good one to a text file. The scripts works well, however is ridiculously slow, approx 7 iterations per second. Is there any faster way to do this?

Here is my code:

import nltk
from nltk.corpus import words
from tqdm import tqdm

total_size = 2170503
with open('two_words.txt','r',encoding='utf-8') as file:
    for word in tqdm(file,total=total_size):
        word = word.strip()
        if all([w in words.words() for w in word.split()]):
            with open('good_two.txt', 'a', encoding='utf-8') as file:
                file.write(word)
                file.write('\n')
        else:
            pass

Is there any faster way of doing the same? IE by using wordnet or any other suggestion?

You can make it much faster by using converting words.words() to a set as the following test shows.

from nltk.corpus import words
import time
# Test Text
text = "she sell sea shell by the seashore"

# Original Method
start = time.time()
x = all([w in words.words() for w in "she sell sea shell by the seashore".split()])
print("Duration Original Method: ", time.time() - start)

# Time to convert words to set
start = time.time()
set_words = set(words.words())
print("Time to generate set: ", time.time() - start)

# Test Using Set (Singe iteration)
start = time.time()
x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 1 iteration: ", time.time() - start)

# Test Using Set (10, 000 iterations)
start = time.time()
for k in range(100000):
    x = all([w in set_words for w in "she sell sea shell by the seashore".split()])
print("Set using 100, 000 iterations: ", time.time() - start)

Results shows using set ~200,000 faster. This is related to words.words() having 236, 736 elements, thus n ~ 236, 736 But, we have reduced the time from O(n) per lookup to O(1) by using sets

Duration Original Method:  0.601 seconds
Time to generate set:  0.131 seconds
Set using 1 iteration:  0.0 seconds
Set using 100, 000 iterations:  0.304 seconds
  1. I would try use threading. Because you do this algorithm just at one thread. But be aware because several writeable streams with one file could be problem. Once you get all words you need, you just merge these files.
  2. The problem is that python is very slow. If you need your solution faster I would consider change language which is not executed by interpreter (sure, good choice is example C/C++), or maybe you can just execute this piece of code in another language from python and then continue with python.
  3. Maybe write the data to binary file could be more faster if you don't have require for .txt output file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM