两台不同机器上的相同python源代码产生不同的行为

Question

两台均运行Ubuntu 14.04.1。的计算机。 相同的源代码对相同的数据运行。 一种工作正常，一种抛出编解码器解码0xe2错误。 为什么是这样？ （更重要的是，我该如何解决？）

令人反感的代码似乎是：

def tokenize(self):
    """Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
    tokenized=''
    for sentence in sent_tokenize(self):
        tokenized += ' '.join(word_tokenize(sentence)) + '\n'

    return Text(tokenized)

好的...我进入了交互模式，并从两台计算机上的nltk.tokenize导入了send_tokenize。 一个可行的人对以下内容感到满意：

>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)

有问题的计算机上的UnicodeDecodeError提供以下回溯：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

逐行（通过split（'\\ n'））分解输入文件，并通过send_tokenize运行每个文件，将我们引向令人讨厌的行：

If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco.

实际上是：

>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'

更新：两台机器都显示UnicodeDecodeError：

unicode(bar[5])

但是只有一台机器显示以下错误：

sent_tokenize(bar[5])

Answer 1

不同的NLTK版本！

不拒绝的版本使用的是NLTK 2.0.4。 引发异常的版本是3.0.0。

NLTK 2.0.4非常满意

sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)')

NLTK 3.0.0需要unicode（如@tdelaney在上面的注释中指出的）。 因此，要获得结果，您需要：

sent_tokenize(u'(\u201cCisco\u201d)')

两台不同机器上的相同python源代码产生不同的行为

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-12-03 18:57:58

两台不同机器上的相同python源代码产生不同的行为

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-12-03 18:57:58

解决方案1
0 已采纳 2014-12-03 18:57:58