简体   繁体   中英

how to segment a sentence from a txt file containing unicode characters from Hindi in python

I have a collection of files in Hindi language in unicode format. I want to perform sentence segmentation on the entire file in python but the file.read() seems to read only few words

Here is the code

## -*- coding: utf-8 -*-
from nltk.tokenize import sent_tokenize
import sys,textwrap
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')
# import codecs
# text=""
# with codecs.open("input/AMCRAJ04.txt") as f1:
#     text = f1.read().replace("\n",'')

# with codecs.open('out.txt','w') as f:
#     f.write(text)

when u print the above i get

['\xff\xfe&\tA\t8\t1\tM\t/\t>\t \x00&\t?', '8\t>\t \x00+\t>\t\x02\t$\tK\t!', 'G\t0\t \x00+\t>\t(\t@\t$\t+\t>\t\x02\t\x17\t>\t']

I can't see the hindi letters neither on display nor in file.I am using Cygwin and Windows for running the code.

Is there any easy way for sentence segmentation? Should be file be read totally into the memory ?

Hmm, first 2 bytes `` fffe` suggest that the input file is encoded as utf16 little endian and not utf8. Because an UTF-16 (or UCS-2) encoded text file often begins with a Byte Order Mark (BOM) which is the unicode character U+FEFF.

You should read your file using utf_16 or utf_16_le encoding.

Don't use the reload trick! It does more harm than good ( ref ).

I had some success getting hindi out of your strings with UTF-16, but your code doesn't generate a list so I'm unsure how you got that data. Posting code that actually reproduces your output would help. Try this instead:

with codecs.open("input/AMCRAJ04.txt",encoding='utf16') as f1:
    text = f1.read().replace("\n",'')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM