using unicode in python for Farsi

Question

I'm writing a script to read from a corpus file and find suffixes. Since there are Persian words in the corpus it is UTF-8 encoded, however when I use Persian suffixes for searching I get no results, English results on the other hand back fine.

from __future__ import unicode_literals
import nltk
import sys


for line in open("corpus.txt"):
for word in line.split():
     if word.endswith('ب'):
        print (word)

Answer 1

In Python 3, you can just pass encoding=utf-8 to open :

with open("corpus.txt", encoding="utf-8") as fp:
    for line in fp:
        for word in line.split():
            process(word)

In Python 2, you'll need to do something like this:

import codecs
with codecs.open("corpus.txt", encoding="utf-8") as fp:
    for line in fp:
        for word in line.split():
            process(word)

using unicode in python for Farsi

Question

1 answers

solution1
4 ACCPTED 2015-05-07 15:08:09

using unicode in python for Farsi

Question

1 answers

solution1 4 ACCPTED 2015-05-07 15:08:09

solution1
4 ACCPTED 2015-05-07 15:08:09