简体   繁体   中英

Spelling corrector for non-English characters

Having read Peter Norvig's How to write a spelling corrector I tried to make the code work for Persian. I rewrote the code like this:

import re, collections

def normalizer(word):
    word = word.replace('ي', 'ی')
    word = word.replace('ك', 'ک')
    word = word.replace('ٔ', '')
    return word

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(normalizer(open("text.txt", encoding="UTF-8").read()))

alphabet = 'ا آ ب پ ت ث ج چ ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی ء'

In Norvig's original code, NWORDS is the dictionary that records the words and their number of occurrences in the text. I tried print (NWORDS) to see if it works with the Persian characters but the result is irrelevant. It doesn't count words, it counts the appearance of separate letters.

Does anyone have any idea where the code went wrong?

PS 'text.txt' is actually a long concatenation of Persian texts, like its equivalent in Norvig's code.

You are applying normalizer to the file object.

I suspect you really want to be doing something like this

with open('text.txt') as fin:
    Nwords = trian(normalizer(word) for ln in fin for word in ln.split()))

I would also look into using Counter http://docs.python.org/2/library/collections.html#collections.Counter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM