I am reposting the previously asked question with the code i tried I am working on a python NLTK tagging program.
My input file is Konkani(Indian language) text containing several lines. I guess I need to encode the input file. Kindly Help.
My code is - for input as a file of several sentences
inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.
Code-
import nltk
file=open('kkn.txt')
t=file.read();
s=nltk.pos_tag(nltk.word_tokenize(t))
print(s)
Gives Error in the Output -
>>>
Traceback (most recent call last):
File "G:/NLTK/inputKonkaniSentence.py", line 4, in <module>
t=file.read();
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 21: character maps to <undefined>
>>>
This is happening because the file you're trying to use is not using the CP1252 encoding. What encoding you're using, is something you'll have to figure out. You have to specify the encoding when you open the file. For example:
file = open(filename, encoding="utf8")
On executing the code - as recommended
import nltk
import re
import time
file = open('kkn.txt', encoding="utf-8")
file.read();
print (file)
n=nltk.pos_tag(nltk.word_tokenize(file))
print(n)
file.close()
The Output :-\\
<_io.TextIOWrapper name='kkn.txt' mode='r' encoding='utf-8'> Traceback (most recent call last): File "G:\\NLTK\\try.py", line 10, in <module> n=nltk.pos_tag(nltk.word_tokenize(file)) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\__init__.py", line 101, in word_tokenize return [token for sent in sent_tokenize(text, language) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\__init__.py", line 86, in sent_tokenize return tokenizer.tokenize(text) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1226, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1274, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1265, in span_tokenize return [(sl.start, sl.stop) for sl in slices] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1265, in <listcomp> return [(sl.start, sl.stop) for sl in slices] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices): File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 310, in _pair_iter prev = next(it) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1278, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.