简体   繁体   中英

Is there an in-built method in nltk to find words/phrases that closely match the given word?

The speech recognition software that I'm using gives less than optimal results.

Eg: session is returned as fashion or mission .

Right now I have a dictionary like:

matches = {
  'session': ['fashion', 'mission'],
  ...
}

and I am looping over all the words to find a match.

I do not mind false positives as the application accepts only a limited set of keywords. However it is tedious to manually enter new words for each of them. Also, the the speech recognizer comes up with new words every time I speak.

I am also running into difficulties where a long word is returned as a group of smaller words, so the above approach won't work.

So, is there an in-built method in nltk to do this? Or even a better algorithm that I could write myself?

You may want to look into python-Levenshtein. It's a python C extension module for calculating string distances/similarities.

Something like this silly inefficient code might work:

from Levenshtein import jaro_winkler  # May not be module name

heard_words = "brain"
possible_words = ["watermelon", "brian"]

word_scores = [jaro-winkler(heard_word, possible) for possible in possible_words]
guessed_word = possible_words[word_scores.index(max(word_scores))]

print('I heard {0} and guessed {1}'.format(heard_word, guessed_word))

Here's the documentation and a non-maintained repo .

You can use the fuzzywuzzy ,a python package for fuzzy matching of words and strings.

To install the package.

pip install fuzzywuzzy

Sample code related to your question.

from fuzzywuzzy import fuzz

MIN_MATCH_SCORE = 80

heard_word = "brain"

possible_words = ["watermelon", "brian"]

guessed_word = [word for word in possible_words if fuzz.ratio(heard_word, word) >= MIN_MATCH_SCORE]

print 'I heard {0} and guessed {1}'.format(heard_word, guessed_word)

Here is the documentation and repo of the fuzzywuzzy .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM