简体   繁体   English

如何在Python中将dict的值聚类?

[英]How to cluster the values of a dict in Python?

Basically, I have a dict in Python with string keys and arrays of ints as values. 基本上,我在Python中使用字符串键和整数数组作为值的字典。

dict = {"Option1Results" : [4, 1, 5, 2, 4],
        "Option2Results" : [11, 44, 2, 1, 5],
        ....
        }

I would like to implement hierarchical clustering on this dict based on the intersection of the values. 我想基于值的交集在此dict上实现分层聚类。 For example, let's say Option1Results and Option4Results share about 70% of the same integers, then cluster them together. 例如,假设Option1Results和Option4Results共享大约70%的相同整数,然后将它们聚在一起。 Is there a way to go about this other than looping through the dictionary and comparing the values one by one? 除了遍历字典并逐一比较值之外,还有其他方法吗?

I think you could utilize two functions cosine similarity and kmeans 我认为您可以利用余弦相似度和kmeans两个函数

cosine similarity: 余弦相似度:

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. 余弦相似度是度量内部乘积空间的两个非零向量之间相似度的度量,该向量测量两个向量之间的夹角余弦。
https://en.wikipedia.org/wiki/Cosine_similarity https://en.wikipedia.org/wiki/Cosine_similarity

data = {'Option{}Results'.format(i):[ random.randint(1,100) for _ in range(5)] for i in range(100)}
pairwise.cosine_similarity(data.values()[0],data.values()[1])
array([[ 0.85988428]])

kmeans: k均值:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k均值聚类是一种矢量量化方法,最初来自信号处理,在数据挖掘的聚类分析中很流行。 k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k均值聚类旨在将n个观察值划分为k个聚类,其中每个观察值均属于具有最均值的聚类,作为聚类的原型。 This results in a partitioning of the data space into Voronoi cells. 这导致将数据空间划分为Voronoi单元。 https://en.wikipedia.org/wiki/K-means_clustering https://en.wikipedia.org/wiki/K-means_clustering

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=0).fit(data.values())
kmeans.predict(data['Option70Results'])
array([2])

To find the intersection of the values of the given dict as a set: 要找到给定字典的值的交集:

intersection = set.intersection(*map(set, dict.values())

Hierarchical clustering can be achieved using scipy's linkage and fcluster. 可以使用scipy的链接和集群来实现分层聚类。 Hierarchical clustering using scipy is explained by this answer . 这个答案解释了使用scipy的层次聚类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM