I am trying to determine the Euclidean distance for my documents from their centroids. The dimensions of the two arrays in question ( points
and centers
) satisfy the XA
and XB
dimensional requirements for scipy.spatial.distance.cdist
, but I don't know why I'm getting the below ValueError
.
My code:
import pandas as pd, numpy as np
from scipy.spatial.distance import cdist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
corpus = pd.Series(["bye bye brutal good bye apple banana orange", "bye bye hello apple banana", "corn wheat apple banana goodbye cookie brutal", "fruit cake banana apple bye sweet sweet"])
X = vectorizer.fit_transform(corpus)
model = Kmeans(n_clusters = 2)
model.fit(X)
centers = model.cluster_centroids_
cdist(X, centers)
This is the error I get:
ValueError: setting an array element with a sequence.
From scipy.spatial.distance.cdist
's documentation:
Parameters: XA: ndarray
An Ma by n array of Ma original observations in an n-dimensional space
XB: ndarray
An Mb by n array of Mb original observations in an n-dimensional space
...
My X
and centers
numpy
arrays certainly satisfy these dimensional conditions for cdist
, right? What am I missing?
Just a small change that you need to do:
cdist(X.toarray(),centers)
Since X is an object of type scipy.sparse.csr.csr_matrix
it will not be directly taken as a valid input by the scipy function. The method toarray() converts it to a valid numpy array
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.