Python NLTK中输入文件的Unicode标记

Question

我正在使用我正在使用python NLTK标记程序的代码重新发布先前询问的问题。

我的输入文件是Konkani（印度语言）文本，其中包含多行。 我想我需要对输入文件进行编码。 请帮助。

我的代码是-输入几个句子的文件

inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.

码-

import nltk

file=open('kkn.txt')
t=file.read();
s=nltk.pos_tag(nltk.word_tokenize(t))

print(s)

在输出中给出错误-

>>> 
Traceback (most recent call last):
  File "G:/NLTK/inputKonkaniSentence.py", line 4, in <module>
    t=file.read();
  File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 21: character maps to <undefined>
>>>

Answer 1

发生这种情况是因为您要使用的文件未使用CP1252编码。 您必须弄清楚所使用的编码方式。 打开文件时必须指定编码。 例如：

file = open(filename, encoding="utf8")

Answer 2

在执行代码时-建议

import nltk
import re
import time

file = open('kkn.txt', encoding="utf-8")
file.read();
print (file)

n=nltk.pos_tag(nltk.word_tokenize(file))
print(n)

file.close()

输出：-\\

 <_io.TextIOWrapper name='kkn.txt' mode='r' encoding='utf-8'> Traceback (most recent call last): File "G:\\NLTK\\try.py", line 10, in <module> n=nltk.pos_tag(nltk.word_tokenize(file)) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\__init__.py", line 101, in word_tokenize return [token for sent in sent_tokenize(text, language) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\__init__.py", line 86, in sent_tokenize return tokenizer.tokenize(text) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1226, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1274, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1265, in span_tokenize return [(sl.start, sl.stop) for sl in slices] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1265, in <listcomp> return [(sl.start, sl.stop) for sl in slices] File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices): File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 310, in _pair_iter prev = next(it) File "C:\\Python34\\lib\\site-packages\\nltk\\tokenize\\punkt.py", line 1278, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer

Python NLTK中输入文件的Unicode标记

问题描述

2 个解决方案

解决方案1
0 2015-05-31 11:06:35

解决方案2
0 2015-05-31 12:24:25

Python NLTK中输入文件的Unicode标记

问题描述

2 个解决方案

解决方案1 0 2015-05-31 11:06:35

解决方案2 0 2015-05-31 12:24:25

解决方案1
0 2015-05-31 11:06:35

解决方案2
0 2015-05-31 12:24:25