简体   繁体   中英

Unicode Tagging of an input file in Python NLTK

I am reposting the previously asked question with the code i tried I am working on a python NLTK tagging program.

My input file is Konkani(Indian language) text containing several lines. I guess I need to encode the input file. Kindly Help.

My code is - for input as a file of several sentences

inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.

Code-

import nltk

file=open('kkn.txt')
t=file.read();
s=nltk.pos_tag(nltk.word_tokenize(t))

print(s)

Gives Error in the Output -

>>> 
Traceback (most recent call last):
  File "G:/NLTK/inputKonkaniSentence.py", line 4, in <module>
    t=file.read();
  File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 21: character maps to <undefined>
>>> 

This is happening because the file you're trying to use is not using the CP1252 encoding. What encoding you're using, is something you'll have to figure out. You have to specify the encoding when you open the file. For example:

file = open(filename, encoding="utf8")

On executing the code - as recommended

import nltk
import re
import time

file = open('kkn.txt', encoding="utf-8")
file.read();
print (file)

n=nltk.pos_tag(nltk.word_tokenize(file))
print(n)

file.close()

The Output :-\\

 <_io.TextIOWrapper name='kkn.txt' mode='r' encoding='utf-8'> Traceback (most recent call last): File "G:\\NLTK\\try.py", line 10, in <module> n=nltk.pos_tag(nltk.word_tokenize(file)) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\__init__.py", line 101, in word_tokenize return [token for sent in sent_tokenize(text, language) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\__init__.py", line 86, in sent_tokenize return tokenizer.tokenize(text) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1226, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1274, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1265, in span_tokenize return [(sl.start, sl.stop) for sl in slices] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1265, in <listcomp> return [(sl.start, sl.stop) for sl in slices] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices): File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 310, in _pair_iter prev = next(it) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1278, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM