how to segment a sentence from a txt file containing unicode characters from Hindi in python

Question

I have a collection of files in Hindi language in unicode format. I want to perform sentence segmentation on the entire file in python but the file.read() seems to read only few words

Here is the code

## -*- coding: utf-8 -*-
from nltk.tokenize import sent_tokenize
import sys,textwrap
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')
# import codecs
# text=""
# with codecs.open("input/AMCRAJ04.txt") as f1:
#     text = f1.read().replace("\n",'')

# with codecs.open('out.txt','w') as f:
#     f.write(text)

when u print the above i get

['\xff\xfe&\tA\t8\t1\tM\t/\t>\t \x00&\t?', '8\t>\t \x00+\t>\t\x02\t$\tK\t!', 'G\t0\t \x00+\t>\t(\t@\t$\t+\t>\t\x02\t\x17\t>\t']

I can't see the hindi letters neither on display nor in file.I am using Cygwin and Windows for running the code.

Is there any easy way for sentence segmentation? Should be file be read totally into the memory ?

Answer 1

Hmm, first 2 bytes `` fffe` suggest that the input file is encoded as utf16 little endian and not utf8. Because an UTF-16 (or UCS-2) encoded text file often begins with a Byte Order Mark (BOM) which is the unicode character U+FEFF.

You should read your file using utf_16 or utf_16_le encoding.

Answer 2

Don't use the reload trick! It does more harm than good ( ref ).

I had some success getting hindi out of your strings with UTF-16, but your code doesn't generate a list so I'm unsure how you got that data. Posting code that actually reproduces your output would help. Try this instead:

with codecs.open("input/AMCRAJ04.txt",encoding='utf16') as f1:
    text = f1.read().replace("\n",'')

how to segment a sentence from a txt file containing unicode characters from Hindi in python

Question

2 answers

solution1
1 ACCPTED 2016-02-25 16:56:26

solution2
0 2016-02-26 10:58:15

how to segment a sentence from a txt file containing unicode characters from Hindi in python

Question

2 answers

solution1 1 ACCPTED 2016-02-25 16:56:26

solution2 0 2016-02-26 10:58:15

solution1
1 ACCPTED 2016-02-25 16:56:26

solution2
0 2016-02-26 10:58:15