Python NLTK Word Tokenize UnicodeDecode Error

Question

I get the error when trying the below code. I try to read from a text file and tokenize the words using nltk. Any ideas? The text file can be found here

from nltk.tokenize import word_tokenize
short_pos = open("./positive.txt","r").read()
#short_pos = short_pos.decode('utf-8').lower()
short_pos_words = word_tokenize(short_pos)

Error:

Traceback (most recent call last):
  File "sentimentAnalysis.py", line 19, in <module>
    short_pos_words = word_tokenize(short_pos)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)

Thanks for your support.

Answer 1

Looks like this text is encoded in Latin-1. So this works for me:

import codecs    
with codecs.open("positive.txt", "r", "latin-1") as inputfile:
        text=inputfile.read()

    short_pos_words = word_tokenize(text)   
    print len(short_pos_words)

You can test for different encodings by eg looking at the file in a good editor like TextWrangler. You can

1) open the file in different encodings to see which one looks good and

2) look at the character that caused the issue. In your case, that is the character in position 4645 - which happens to be an accented word from a Spanish review. That is not part of Ascii, so that doesn't work; it's also not a valid codepoint in UTF-8.

Answer 2

Your file is encoded using "latin-1" .

from nltk.tokenize import word_tokenize
import codecs   

with codecs.open("positive.txt", "r", "latin-1") as inputfile:
    text=inputfile.read()

short_pos_words = word_tokenize(text)   
print short_pos_words

Python NLTK Word Tokenize UnicodeDecode Error

Question

2 answers

solution1
1 ACCPTED 2016-07-27 21:55:22

solution2
0 2016-07-28 00:31:53

Python NLTK Word Tokenize UnicodeDecode Error

Question

2 answers

solution1 1 ACCPTED 2016-07-27 21:55:22

solution2 0 2016-07-28 00:31:53

solution1
1 ACCPTED 2016-07-27 21:55:22

solution2
0 2016-07-28 00:31:53