简体   繁体   中英

How to cluster sparse data using Sklearn Kmeans

How do you cluster sparse data using Sklearn's Kmeans implementation?

Attempting to adapt their example for my own use case, I tried:

from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans

mydata = [
    (1, {'word1': 2, 'word3': 6, 'word7': 4}),
    (2, {'word11': 1, 'word7': 9, 'word3': 2}),
    (3, {'word5': 7, 'word1': 3, 'word9': 8}),
]

kmeans_data = []
for index, raw_data in mydata:
    cnt_sum = float(sum(raw_data.values()))
    freqs = dict((k, v/cnt_sum) for k, v in raw_data.items())
    v = DictVectorizer(sparse=True)
    X = v.fit_transform(freqs)
    kmeans_data.append(X)

kmeans = KMeans(n_clusters=2, random_state=0).fit(kmeans_data)

but this throws the exception:

  File "/myproject/.env/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 854, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
  File "/myproject/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

Presumably I'm not constructing my sparse input matrix X correctly, as it's a list of sparse matrices instead of a sparse matrix containing lists. How do I construct a proper input matrix?

You are building a sparse matrix incrementally. I am not sure if you could use DictVectorizer in an incremental manner. It would be simpler to just add the elements to the matrix one by one. See the last example in scipy.sparse.csr_matrix documentation .

Incremental construction

Consider the following double loop:

data = []
rows = []
cols = []
vocabulary = {}
for index, raw_data in mydata:
    cnt_sum = float(sum(raw_data.values()))
    for k,v in raw_data.items():
        f = v/cnt_sum
        i = vocabulary.setdefault(k,len(vocabulary))
        cols.append(i)
        rows.append(index-1)
        data.append(f)

kmeans_data = csr_matrix((data,(rows,cols)))

Then kmeans_data is a sparse matrix suitable for use as input to K-means classifier.

Direct construction

With DictVectorizer you could construct the data matrix from the list of tuples and then use sparse linear algebra routines to perform normalization of rows.

# 1. Construct the sparse matrix with numbers_of_occurrences
D = [d[1] for d in mydata]
v = DictVectorizer(sparse=True)
kmeans_data = v.fit_transform(D)
# 2. Normalize by computing sums for each row and dividing 
import numpy as np
sums = np.sum(kmeans_data,axis=1).A[:,0]
N = len(s)
divisor = csr_matrix((np.reciprocal(s),(range(N),range(N))))
kmeans_data = divisor*kmeans_data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM