简体   繁体   English

为什么即使在执行 nltk.download 并且所有软件包都已正确安装之后,nltk word_tokenize 也无法正常工作?

[英]Why nltk word_tokenize is not working even after doing a nltk.download and all the packages are installed correctly?

I am using python 3.7 64 bit.我正在使用 python 3.7 64 位。 nltk version 3.4.5. nltk 版本 3.4.5。

When I try to convert text6 in nltk.book to tokens using word_tokenize, I am getting error.当我尝试使用 word_tokenize 将 nltk.book 中的 text6 转换为令牌时,出现错误。

import nltk
from nltk.tokenize import word_tokenize
from nltk.book import *
tokens=word_tokenize(text6)

code is done in idle 3.7代码在空闲 3.7 中完成

Below is the error when I execute the last statement.以下是我执行最后一条语句时的错误。

    Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    tokens=word_tokenize(text6)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\__init__.py", line 144, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1277, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1331, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1331, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1321, in span_tokenize
    for sl in slices:
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1362, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 318, in _pair_iter
    prev = next(it)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1335, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object

Please help.请帮忙。 Thanks in advance.提前致谢。

While doing some troubleshooting I have created a sample nltk.text.Text object and tried to tokenize it with nltk.word_tokenize.在进行一些故障排除时,我创建了一个示例 nltk.text.Text object 并尝试使用 nltk.word_tokenize 对其进行标记。 Still I am getting the same error.我仍然遇到同样的错误。 Please see the below screenshot.请看下面的截图。 在此处输入图像描述

But while calling the nltk.word_tokenize() on string, its working.但是在字符串上调用 nltk.word_tokenize() 时,它可以工作。

>>> tt="Python is a programming language"
>>> tokens2=nltk.word_tokenize(tt) #Not throwing error
>>> type(tt)
<class 'str'>
>>> type(text6)
<class 'nltk.text.Text'>
>>> 

Check the nltk data folder.检查 nltk 数据文件夹。 And where it expects it should be located.以及它期望它应该位于的位置。

Try using:尝试使用:

nltk.download('punkt') nltk.download('朋克')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM