简体   繁体   中英

nltk NERTagger UnicodeDecodeError in python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:

import re

from nltk.tag.stanford import NERTagger

WORD = re.compile(r'\w+')

st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")

text = "title Wienfilm 1896-1976 (1976)"

words = WORD.findall(text )

print words

answer = st.tag(words )

print answer

The last print statement in the program suppose to return a tuple consisting of five lists as:

     [(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]

However when I run the program, it gives me the following error message:

['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
  File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
    answer = st.tag(words )
  File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
    return self.tag_sents([tokens])[0]
  File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
    stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in  
  range(128)

Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.

In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:

    for i, word_pos in enumerate(answer):
         word, pos = word_pos
         print i ,  word.encode('utf-16'), pos.encode('utf-16') 

I got the following incorrect output:

             0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O 

Please any clues on how to deal with this issue? Thanks in advance.

这对我有用:创建NERTagger对象时,将编码参数指定为UTF-8

st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')

Open terminal(cmd), and write;

chcp

It should return something like;

active code page: 857

Then, write;

chcp 1254

After then, in your .py script, to the top of your script write;

# -*- coding: cp1254 -*-

This should solve your problem.If it's not, copy these codes and paste to the top of your script.

# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')

I had many problems with decoding before, these methods solved.

ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .

And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.

At the top of your app add:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.

You must open internals.py saven on: %YourPythonFolder%\\Lib\\site-packages\\nltk\\internals.py

Then go to the method java and adding this line after #construct the full command string (about line 147)

cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']

This section code must look like:

# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']

Hope it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM