简体   繁体   English

两台不同机器上的相同python源代码产生不同的行为

[英]same python source code on two different machines yield different behavior

Two machines, both running Ubuntu 14.04.1. 两台均运行Ubuntu 14.04.1。的计算机。 Same source code run on the same data. 相同的源代码对相同的数据运行。 One works fine, one throws codec decode 0xe2 error. 一种工作正常,一种抛出编解码器解码0xe2错误。 Why is this? 为什么是这样? (More importantly, how do I fix it?) (更重要的是,我该如何解决?)

Offending code appears to be: 令人反感的代码似乎是:

def tokenize(self):
    """Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
    tokenized=''
    for sentence in sent_tokenize(self):
        tokenized += ' '.join(word_tokenize(sentence)) + '\n'

    return Text(tokenized)

OK... I went into interactive mode and imported sent_tokenize from nltk.tokenize on both machines. 好的...我进入了交互模式,并从两台计算机上的nltk.tokenize导入了send_tokenize。 The one that works was happy with the following: 一个可行的人对以下内容感到满意:

>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)

The UnicodeDecodeError on the machine with issues gives the following traceback: 有问题的计算机上的UnicodeDecodeError提供以下回溯:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

Breaking the input file down line by line (via split('\\n')), and running each one through sent_tokenize leads us to the offending line: 逐行(通过split('\\ n'))分解输入文件,并通过send_tokenize运行每个文件,将我们引向令人讨厌的行:

If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco.

Which is actually: 实际上是:

>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'

Update: both machines show UnicodeDecodeError for: 更新:两台机器都显示UnicodeDecodeError:

unicode(bar[5])

But only one machine shows an error for: 但是只有一台机器显示以下错误:

sent_tokenize(bar[5])

Different NLTK versions! 不同的NLTK版本!

The version that doesn't barf is using NLTK 2.0.4; 不拒绝的版本使用的是NLTK 2.0.4。 the version throwing an exception is 3.0.0. 引发异常的版本是3.0.0。

NLTK 2.0.4 was perfectly happy with NLTK 2.0.4非常满意

sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)')

NLTK 3.0.0 needs unicode (as pointed out by @tdelaney in the comments above). NLTK 3.0.0需要unicode(如@tdelaney在上面的注释中指出的)。 So to get results, you need: 因此,要获得结果,您需要:

sent_tokenize(u'(\u201cCisco\u201d)')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用两台不同的机器访问Python代码(项目) - Accesing Python code (project) with two different machines 相同的Python代码,相同的数据,不同机器上的结果不同 - Same Python code, same data, different results on different machines 不同网络上不同机器上的 Python 代码并行化 - Parallelization of Python code on different machines on different networks 不同机器上两个进程之间的Python IPC - Python IPC between two processes on different machines 对于安装在两台不同计算机上的同一程序,CLSID可以不同吗? - Can a CLSID be different for the same program installed on two different machines? python:相同的字符,不同的行为 - python :same character, different behavior 使用相同种子、代码和数据集的不同机器上的不同精度 - Different accuracies on different machines using same seeds, code and dataset Singularity 图像在不同机器上的不同行为? - Different behavior of Singularity image on different machines? Python 轮:相同的源代码但不同的 md5sum - Python wheel: same source code but different md5sum python 的两个版本为相同的代码给出了两个不同的结果 - Two versions of python giving two different results for the same code
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM