[英]Removing stop words from tokenized text using NLTK: TypeError
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer
import re
import time
txt = input()
snt_tkn = sent_tokenize(txt)
wrd_tkn = [word_tokenize(s) for s in snt_tkn]
stp_wrd = set(stopwords.words("english"))
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
print(flt_snt)
returns the following:返回以下内容:
Traceback (most recent call last):
File "compiler.py", line 19, in
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
File "compiler.py", line 19, in
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
TypeError: unhashable type: 'list'
I'd like to know, if possible, how to return the tokenized text with stop words removed without editing wrd_tkn
.如果可能,我想知道如何在不编辑
wrd_tkn
情况下返回删除停用词的标记化文本。
The error say to you that list is unhasahble.错误告诉你该列表是不可更改的。 You might try to make it hashable, actually lists are not hasheble because they are mutable, try to convert list to set that is not mutable and that is hashable.
您可能会尝试使其可散列,实际上列表不是可散列的,因为它们是可变的,尝试将列表转换为不可变且可散列的集合。 It can be done by constructor function
可以通过构造函数来完成
immutable_list = set(some_list)
For future reference, the resolution is the following:为了将来参考,决议如下:
change改变
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
to至
flt_snt = [[w for w in s if not w in stp_wrd]for s in wrd_tkn]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.