Traceback (most recent call last):
File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/abc", line 11, in <module>
docs2 = [[w.lower() for w in doc]for doc in docs]
File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/abc", line 11, in <listcomp>
docs2 = [[w.lower() for w in doc]for doc in docs]
File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/", line 11, in <listcomp>
docs2 = [[w.lower() for w in doc]for doc in docs]
File "C:\Python34\lib\site-packages\nltk\corpus\reader\util.py", line 291, in iterate_from
['PROJECT', 'FINAL', 'REPORT', 'Revision', 'History', 'Date', 'Version', 'Author', 'Validated', 'by', 'Purpose', '4', '-', 'Dec', '-', '13', '0', '.', '1', 'EA', 'Initial', 'Document', '1', '/', '8', '/', '2014', '0', '.', '2', 'EA', '&', 'AHE', 'Combined', 'the', 'copy', 'for', 'both', 'MOE', 'and', 'MOA', '.', '1', '/', '8', '/', '2014', '0', '.', '3']
tokens = self.read_block(self._stream)
File "C:\Python34\lib\site-packages\nltk\corpus\reader\plaintext.py", line 117, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python34\lib\site-packages\nltk\data.py", line 1095, in readline
new_chars = self._read(readsize)
File "C:\Python34\lib\site-packages\nltk\data.py", line 1322, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python34\lib\site-packages\nltk\data.py", line 1352, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python34\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 50: invalid continuation byte
I am trying to perform preprocessing of text using NLTK. However i keep running into this error. Some thoughts would be helpful
Some lines of code would be useful. However, my intuition says your corpus reader object should deal with another encoding rather than utf8, probably latin-1.
corpus = nltk.corpus.reader.PlaintextCorpusReader(
"/path/to/files", r'.*', encoding='latin-1')
See also here: UnicodeDecodeError, invalid continuation byte
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.