简体   繁体   English

如何使用Sklearn Kmeans聚类稀疏数据

[英]How to cluster sparse data using Sklearn Kmeans

How do you cluster sparse data using Sklearn's Kmeans implementation? 如何使用Sklearn的Kmeans实现对稀疏数据进行聚类?

Attempting to adapt their example for my own use case, I tried: 尝试根据自己的用例调整其示例,我尝试:

from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans

mydata = [
    (1, {'word1': 2, 'word3': 6, 'word7': 4}),
    (2, {'word11': 1, 'word7': 9, 'word3': 2}),
    (3, {'word5': 7, 'word1': 3, 'word9': 8}),
]

kmeans_data = []
for index, raw_data in mydata:
    cnt_sum = float(sum(raw_data.values()))
    freqs = dict((k, v/cnt_sum) for k, v in raw_data.items())
    v = DictVectorizer(sparse=True)
    X = v.fit_transform(freqs)
    kmeans_data.append(X)

kmeans = KMeans(n_clusters=2, random_state=0).fit(kmeans_data)

but this throws the exception: 但这引发了异常:

  File "/myproject/.env/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 854, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
  File "/myproject/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

Presumably I'm not constructing my sparse input matrix X correctly, as it's a list of sparse matrices instead of a sparse matrix containing lists. 大概我没有正确构造我的稀疏输入矩阵X,因为它是稀疏矩阵的列表,而不是包含列表的稀疏矩阵。 How do I construct a proper input matrix? 如何构造适当的输入矩阵?

You are building a sparse matrix incrementally. 您正在逐步构建稀疏矩阵。 I am not sure if you could use DictVectorizer in an incremental manner. 我不确定是否可以增量使用DictVectorizer。 It would be simpler to just add the elements to the matrix one by one. 将元素逐一添加到矩阵会更简单。 See the last example in scipy.sparse.csr_matrix documentation . 请参阅scipy.sparse.csr_matrix 文档中的最后一个示例。

Incremental construction 增量施工

Consider the following double loop: 考虑以下双重循环:

data = []
rows = []
cols = []
vocabulary = {}
for index, raw_data in mydata:
    cnt_sum = float(sum(raw_data.values()))
    for k,v in raw_data.items():
        f = v/cnt_sum
        i = vocabulary.setdefault(k,len(vocabulary))
        cols.append(i)
        rows.append(index-1)
        data.append(f)

kmeans_data = csr_matrix((data,(rows,cols)))

Then kmeans_data is a sparse matrix suitable for use as input to K-means classifier. 那么kmeans_data是一个稀疏矩阵,适合用作K-means分类器的输入。

Direct construction 直接施工

With DictVectorizer you could construct the data matrix from the list of tuples and then use sparse linear algebra routines to perform normalization of rows. 使用DictVectorizer,您可以从元组列表构造数据矩阵,然后使用稀疏线性代数例程对行进行归一化。

# 1. Construct the sparse matrix with numbers_of_occurrences
D = [d[1] for d in mydata]
v = DictVectorizer(sparse=True)
kmeans_data = v.fit_transform(D)
# 2. Normalize by computing sums for each row and dividing 
import numpy as np
sums = np.sum(kmeans_data,axis=1).A[:,0]
N = len(s)
divisor = csr_matrix((np.reciprocal(s),(range(N),range(N))))
kmeans_data = divisor*kmeans_data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM