Having read Peter Norvig's How to write a spelling corrector I tried to make the code work for Persian. I rewrote the code like this:
import re, collections
def normalizer(word):
word = word.replace('ي', 'ی')
word = word.replace('ك', 'ک')
word = word.replace('ٔ', '')
return word
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(normalizer(open("text.txt", encoding="UTF-8").read()))
alphabet = 'ا آ ب پ ت ث ج چ ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی ء'
In Norvig's original code, NWORDS is the dictionary that records the words and their number of occurrences in the text. I tried print (NWORDS)
to see if it works with the Persian characters but the result is irrelevant. It doesn't count words, it counts the appearance of separate letters.
Does anyone have any idea where the code went wrong?
PS 'text.txt' is actually a long concatenation of Persian texts, like its equivalent in Norvig's code.
You are applying normalizer
to the file object.
I suspect you really want to be doing something like this
with open('text.txt') as fin:
Nwords = trian(normalizer(word) for ln in fin for word in ln.split()))
I would also look into using Counter
http://docs.python.org/2/library/collections.html#collections.Counter
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.