[英]Removing stop words from tokenized text using NLTK: TypeError
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer
import re
import time
txt = input()
snt_tkn = sent_tokenize(txt)
wrd_tkn = [word_tokenize(s) for s in snt_tkn]
stp_wrd = set(stopwords.words("english"))
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
print(flt_snt)
返回以下內容:
Traceback (most recent call last):
File "compiler.py", line 19, in
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
File "compiler.py", line 19, in
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
TypeError: unhashable type: 'list'
如果可能,我想知道如何在不編輯wrd_tkn
情況下返回刪除停用詞的標記化文本。
錯誤告訴你該列表是不可更改的。 您可能會嘗試使其可散列,實際上列表不是可散列的,因為它們是可變的,嘗試將列表轉換為不可變且可散列的集合。 可以通過構造函數來完成
immutable_list = set(some_list)
為了將來參考,決議如下:
改變
flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
至
flt_snt = [[w for w in s if not w in stp_wrd]for s in wrd_tkn]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.