I'm using the symspellpy module in Python for query correction. It is really useful and fast, but I'm having a issue with it.
Is there a way to force Symspell to return more than one recommendation for correction. I need it to analyse a better correction based on my application.
I'm calling Symspell like this:
suggestions = sym_spell.lookup(query, VERBOSITY_ALL, max_edit_distance=3)
Example of what I'm trying to do:
query = "resende"
. The return that I want ["resende", "rezende"]
. What the method returns ["resende"]
. Note that both "resende" and "rezende" are in my dictionary.
Merely a typo. Change the underscore in
Verbosity_ALL
... to
Verbosity.ALL
The three options are CLOSEST
, TOP
and ALL
Couple of other things in SymSpell...
Describedhere
Supported edit distance algorithm choices.
LEVENSHTEIN = 0 Levenshtein algorithm
DAMERAU_OSA = 1 Damerau optimal string alignment algorithm (default)
LEVENSHTEIN_FAST = 2 Fast Levenshtein algorithm
DAMERAU_OSA_FAST = 3 Fast Damerau optimal string alignment algorithm
DAMERAU_OSA # high count/frequency wins when using .ALL but distances tied?
LEVENSHTEIN # lowest edit distance wins (fewest changes needed)
To change from the default, overwrite it with one of them:
from symspellpy.editdistance import DistanceAlgorithm
sym_spell._distance_algorithm = DistanceAlgorithm.LEVENSHTEIN
word = 'something'
matches = sym_spell.lookup(word, Verbosity.ALL, max_edit_distance=2)
for match in matches: # match is ... term, distance, count
print(f'{word} -> {match.term} {match.distance} {match.count}')
SymSpell can only read the dictionary of ok words from a file currently (Apr 2022) however this can be added inside symspellpy.py to make it able to read from a collections Counter() output dict or other dictionary of words: counts
, a mere quick hack that works for my purposes...
def load_Counter_dictionary(self, counts_each):
for key, count in counts_each.items():
self.create_dictionary_entry(key, count)
Can then drop the use of load_dictionary(), for something like this instead...
sym_spell.load_Counter_dictionary( Counter(words_list) )
The reason I resorted to that is a million+ record csv file was already loaded into a pandas dataframe containing a column of codes (think words) with some of them in large numbers (likely correct) along with outliers to be corrected and a column already made containing their counts each. So rather than saving the counts dict to file (expensive) and the reload by SymSpell, this is direct and efficient.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.