简体   繁体   中英

Removing stop words from tokenized text using NLTK: TypeError

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer
import re
import time

txt = input()

snt_tkn = sent_tokenize(txt)

wrd_tkn = [word_tokenize(s) for s in snt_tkn]

stp_wrd = set(stopwords.words("english"))

flt_snt = [w for w in wrd_tkn if not w in stp_wrd]

print(flt_snt)

returns the following:

Traceback (most recent call last):
  File "compiler.py", line 19, in 
    flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
  File "compiler.py", line 19, in 
    flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
TypeError: unhashable type: 'list'

I'd like to know, if possible, how to return the tokenized text with stop words removed without editing wrd_tkn .

The error say to you that list is unhasahble. You might try to make it hashable, actually lists are not hasheble because they are mutable, try to convert list to set that is not mutable and that is hashable. It can be done by constructor function

immutable_list = set(some_list)

For future reference, the resolution is the following:

change

flt_snt = [w for w in wrd_tkn if not w in stp_wrd]

to

flt_snt = [[w for w in s if not w in stp_wrd]for s in wrd_tkn]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM