简体   繁体   English

使用 NLTK 从标记化文本中删除停用词:TypeError

[英]Removing stop words from tokenized text using NLTK: TypeError

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer
import re
import time

txt = input()

snt_tkn = sent_tokenize(txt)

wrd_tkn = [word_tokenize(s) for s in snt_tkn]

stp_wrd = set(stopwords.words("english"))

flt_snt = [w for w in wrd_tkn if not w in stp_wrd]

print(flt_snt)

returns the following:返回以下内容:

Traceback (most recent call last):
  File "compiler.py", line 19, in 
    flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
  File "compiler.py", line 19, in 
    flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
TypeError: unhashable type: 'list'

I'd like to know, if possible, how to return the tokenized text with stop words removed without editing wrd_tkn .如果可能,我想知道如何在不编辑wrd_tkn情况下返回删除停用词的标记化文本。

The error say to you that list is unhasahble.错误告诉你该列表是不可更改的。 You might try to make it hashable, actually lists are not hasheble because they are mutable, try to convert list to set that is not mutable and that is hashable.您可能会尝试使其可散列,实际上列表不是可散列的,因为它们是可变的,尝试将列表转换为不可变且可散列的集合。 It can be done by constructor function可以通过构造函数来完成

immutable_list = set(some_list)

For future reference, the resolution is the following:为了将来参考,决议如下:

change改变

flt_snt = [w for w in wrd_tkn if not w in stp_wrd]

to

flt_snt = [[w for w in s if not w in stp_wrd]for s in wrd_tkn]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM