如何从PlaintextCorpusReader读取原始数据时摆脱UnicodeDecodeError

Question

我以下列方式从一组文本文件创建一个语料库：

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

现在，我希望通过以下方式访问文件中的单词：

text_bow = newcorpus.words("file_name.txt")

但是我收到以下错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

有多个文件抛出错误。 如何摆脱这个UnicodeDecodeError？

Answer 1

要摆脱解码错误，请执行以下任一操作。

以字节为单位读取语料库文件，并且不要解码为unicode。
发现并使用文件使用的编码。 （语料库文档应该告诉您。）我怀疑它是Latin-1。
不管实际编码如何，都使用Latin-1。 即使结果字符串在没有原始内容的情况下是错误的，也将消除该异常。

Answer 2

首先，找到我们文件编码的编码方式。也许尝试https://stackoverflow.com/a/16203777/610569或询问数据源。

然后在PlaintextCorpusReader使用encoding=参数，例如对于latin-1 ：

newcorpus = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')

从代码https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py中：

class PlaintextCorpusReader(CorpusReader):
"""
Reader for corpora that consist of plaintext documents.  Paragraphs
are assumed to be split using blank lines.  Sentences and words can
be tokenized using the default tokenizers, or by custom tokenizers
specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface
sections of specific document formats) by creating a subclass and
overriding the ``CorpusView`` class variable.
"""

CorpusView = StreamBackedCorpusView
"""The corpus view class used by this reader.  Subclasses of
   ``PlaintextCorpusReader`` may specify alternative corpus view
   classes (e.g., to skip the preface sections of documents.)"""

def __init__(self, root, fileids,
             word_tokenizer=WordPunctTokenizer(),
             sent_tokenizer=nltk.data.LazyLoader(
                 'tokenizers/punkt/english.pickle'),
             para_block_reader=read_blankline_block,
             encoding='utf8'):

如何从PlaintextCorpusReader读取原始数据时摆脱UnicodeDecodeError

问题描述

2 个解决方案

解决方案1
0 2017-12-19 03:05:00

解决方案2
0 2017-12-19 03:21:03

如何从PlaintextCorpusReader读取原始数据时摆脱UnicodeDecodeError

问题描述

2 个解决方案

解决方案1 0 2017-12-19 03:05:00

解决方案2 0 2017-12-19 03:21:03

解决方案1
0 2017-12-19 03:05:00

解决方案2
0 2017-12-19 03:21:03