简体   繁体   English

类型错误:不可哈希类型:使用Python字符串集时的列表

[英]Type Error: unhashable type: list when using Python set of strings

I know there's been several very similar answers here on SO on this exact question, but none of them really answer my question. 我知道在这个确切的问题上,有几个非常相似的答案,但是没有一个能真正回答我的问题。

I'm trying to remove a series of stop words and punctuation from a list of words to perform basic natural language processing. 我正在尝试从单词列表中删除一系列停用词和标点符号,以执行基本的自然语言处理。

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation


    text = "Hello there. I am currently typing Python. "
    custom_stopwords = set(stopwords.words('english')+list(punctuation))

    # tokenizes the text into a sentence
    sentences = sent_tokenize(text)

    # tokenizes each sentence into a list of words
    words = [word_tokenize(sentence) for sentence in sentences]
    filtered_words = [word for word in words if word not in custom_stopwords]
    print(filtered_words)

This throws a TypeError: unhashable type: 'list' error on the filtered_words line. 这在filtered_words行上引发TypeError: unhashable type: 'list'错误。 Why is this error being thrown? 为什么会引发此错误? I'm not providing a list collection at all- I'm providing a set ? 我根本不提供list集合,而是提供set吗?

Note: I've read the post on SO on this exact error , but still have the same question. 注意:我已经阅读了有关此确切错误的SO文章,但仍然有相同的问题。 The accepted answer provides this explanation: 接受的答案提供了以下解释:

Sets require their items to be hashable. 集要求其项是可哈希的。 Out of types predefined by Python only the immutable ones, such as strings, numbers, and tuples, are hashable. 在Python预定义的类型中,只有不可变的类型(例如字符串,数字和元组)是可哈希的。 Mutable types, such as lists and dicts, are not hashable because a change of their contents would change the hash and break the lookup code. 可变类型(例如列表和字典)不可散列,因为更改其内容将更改散列并破坏查找代码。

I'm providing a set of strings here, so why is Python still complaining? 我在这里提供了一组字符串,为什么Python仍在抱怨?

EDIT: after reading more into this SO post , which recommends using tuples , I edited my collection object: 编辑:在阅读更多关于SO的文章 (建议使用tuples ,我编辑了集合对象:

custom_stopwords = tuple(stopwords.words('english'))

I also realized I have to flatten my list, since word_tokenize(sentence) will create a list of lists, and will not filter out punctuation correctly (since a list object will not be in custom_stopwords , which is a list of strings. 我还意识到我必须弄平我的列表,因为word_tokenize(sentence)将创建一个列表列表,并且不会正确滤除标点符号(因为列表对象将不在custom_stopwords ,即字符串列表)。

However, this still begs the question- why are tuples considered hashable by Python but string sets not? 然而,这仍然引出一个问题:为什么Python认为元组可以哈希,而字符串集却不能? And why does the TypeError say list ? 为什么TypeErrorlist

words is a list of lists since word_tokenize() returns a list of words. words列表的列表,因为word_tokenize()返回单词列表。

When you do [word for word in words if word not in custom_stopwords] each word is actually of a list type. 当您执行[word for word in words if word not in custom_stopwords]每个word实际上都是list类型。 When the word not in custom_stopwords "is in set" condition needs to be checked, word needs to be hashed which fails because lists are mutable containers and are not hashable in Python . 当需要检查word not in custom_stopwordsword not in custom_stopwords “处于设置状态”时,需要对word进行哈希处理,这会失败,因为列表是可变容器,在Python中不可哈希。

These posts might help to understand what is "hashable" and why mutable containers are not: 这些帖子可能有助于了解什么是“可哈希”以及为什么可变容器不可行:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM