[英]same python source code on two different machines yield different behavior
Two machines, both running Ubuntu 14.04.1. 两台均运行Ubuntu 14.04.1。的计算机。 Same source code run on the same data.
相同的源代码对相同的数据运行。 One works fine, one throws codec decode 0xe2 error.
一种工作正常,一种抛出编解码器解码0xe2错误。 Why is this?
为什么是这样? (More importantly, how do I fix it?)
(更重要的是,我该如何解决?)
Offending code appears to be: 令人反感的代码似乎是:
def tokenize(self):
"""Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
tokenized=''
for sentence in sent_tokenize(self):
tokenized += ' '.join(word_tokenize(sentence)) + '\n'
return Text(tokenized)
OK... I went into interactive mode and imported sent_tokenize from nltk.tokenize on both machines. 好的...我进入了交互模式,并从两台计算机上的nltk.tokenize导入了send_tokenize。 The one that works was happy with the following:
一个可行的人对以下内容感到满意:
>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)
The UnicodeDecodeError on the machine with issues gives the following traceback: 有问题的计算机上的UnicodeDecodeError提供以下回溯:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
Breaking the input file down line by line (via split('\\n')), and running each one through sent_tokenize leads us to the offending line: 逐行(通过split('\\ n'))分解输入文件,并通过send_tokenize运行每个文件,将我们引向令人讨厌的行:
If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco.
Which is actually: 实际上是:
>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'
Update: both machines show UnicodeDecodeError for: 更新:两台机器都显示UnicodeDecodeError:
unicode(bar[5])
But only one machine shows an error for: 但是只有一台机器显示以下错误:
sent_tokenize(bar[5])
Different NLTK versions! 不同的NLTK版本!
The version that doesn't barf is using NLTK 2.0.4; 不拒绝的版本使用的是NLTK 2.0.4。 the version throwing an exception is 3.0.0.
引发异常的版本是3.0.0。
NLTK 2.0.4 was perfectly happy with NLTK 2.0.4非常满意
sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)')
NLTK 3.0.0 needs unicode (as pointed out by @tdelaney in the comments above). NLTK 3.0.0需要unicode(如@tdelaney在上面的注释中指出的)。 So to get results, you need:
因此,要获得结果,您需要:
sent_tokenize(u'(\u201cCisco\u201d)')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.