简体   繁体   中英

ngram character based spell checking

I was developing the n gram spell check as per the mentioned example . Although the algorithmic approach will be as follows:

Consider 2 strings “statistics” and “statistical”. If n is set to 2 (bi-grams are being extracted), then the similarity of the two strings is calculated as follows:

Initially, the two strings are split into bi-grams:

Statistics - st ta at ti is st ti ic cs 9 bigrams

Statistical - st ta at ti is st ti ic ca al 10 bigrams

Then find the unique bi-grams in each string

Statistics - st ta at is ti ic cs (7 unique bigrams)

Statistical - st ta at ti is ic ca al (8 unique bigrams)

Next, find the unique bi-grams that are shared with both the terms.

There are 6 such bi-grams: st ta at ic is ti.

The similarity measure is calculated using similarity coefficient with the following formula:

Similarity coefficient = 2*C/A+B

A - unique n-grams in term 1.
B - unique n-grams in term 2.
C - unique n-grams appearing in term 1 and term 2.

The above example would produce the result (2*6) / (7+8) = 0.80. Higher the similarity measure is, more relevant is the word for correction.

My sample output for the program looks like:

Enter a word: ttem
temp : 0.5
stem : 0.5
items : 0.4444444444444444
item : 0.5

How do i select the most probable candidate among them . i hope you can provide some sort of solutions to this. hope to see you guys.

Based on ngram there is no preferred option for correction. Some spell checkers do offer multiple corrections. If you want to choose one, you might consider adding other rules for the selection, like Levenshtein distance - the minimum number of single-character edits between the words, or giving significance score for each letter (eg make z worth a lot and e worth less, since z is less likely to be written by mistake).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM