nltk NERTagger UnicodeDecodeError in python

Question

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:

import re

from nltk.tag.stanford import NERTagger

WORD = re.compile(r'\w+')

st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")

text = "title Wienfilm 1896-1976 (1976)"

words = WORD.findall(text )

print words

answer = st.tag(words )

print answer

The last print statement in the program suppose to return a tuple consisting of five lists as:

     [(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]

However when I run the program, it gives me the following error message:

['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
  File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
    answer = st.tag(words )
  File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
    return self.tag_sents([tokens])[0]
  File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
    stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in  
  range(128)

Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.

In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:

    for i, word_pos in enumerate(answer):
         word, pos = word_pos
         print i ,  word.encode('utf-16'), pos.encode('utf-16')

I got the following incorrect output:

             0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O

Please any clues on how to deal with this issue? Thanks in advance.

Answer 1

这对我有用：创建NERTagger对象时，将编码参数指定为UTF-8

st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')

Answer 2

Open terminal(cmd), and write;

chcp

It should return something like;

active code page: 857

Then, write;

chcp 1254

After then, in your .py script, to the top of your script write;

# -*- coding: cp1254 -*-

This should solve your problem.If it's not, copy these codes and paste to the top of your script.

# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')

I had many problems with decoding before, these methods solved.

ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .

And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.

Answer 3

At the top of your app add:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Answer 4

I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.

You must open internals.py saven on: %YourPythonFolder%\\Lib\\site-packages\\nltk\\internals.py

Then go to the method java and adding this line after #construct the full command string (about line 147)

cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']

This section code must look like:

# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']

Hope it helps.

nltk NERTagger UnicodeDecodeError in python

Question

4 answers

solution1
1 2015-01-09 07:47:57

solution2
0

solution3
0 2015-01-03 07:25:46

solution4
0 2015-01-24 22:29:50

nltk NERTagger UnicodeDecodeError in python

Question

4 answers

solution1 1 2015-01-09 07:47:57

solution2 0

solution3 0 2015-01-03 07:25:46

solution4 0 2015-01-24 22:29:50

solution1
1 2015-01-09 07:47:57

solution2
0

solution3
0 2015-01-03 07:25:46

solution4
0 2015-01-24 22:29:50