简体   繁体   English

在单个要素数据框中查找质心和点之间的距离-KMeans

[英]Find distance between centroid and points in a single feature dataframe - KMeans

I'm working on an anomaly detection task using KMeans. 我正在使用KMeans进行异常检测任务。
Pandas dataframe that i'm using has a single feature and it is like the following one: 我正在使用的Pandas数据框具有一个功能,它类似于以下功能:

df = array([[12534.],
           [12014.],
           [12158.],
           [11935.],
           ...,
           [ 5120.],
           [ 4828.],
           [ 4443.]])

I'm able to fit and to predict values with the following instructions: 我可以按照以下说明进行调整并预测值:

km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)

In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach. 为了识别异常,我想计算质心和每个单点之间的距离,但是对于具有单个功能的数据框,我不确定这是正确的方法。

I found examples which used euclidean distance to calculate the distance. 我找到了使用欧几里得距离来计算距离的示例。 An example is the following one: 下面是一个示例:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
    distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
    return distances

centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
    distances.append({'x': cx, 'y': cy, 'distance': mean_distance})

This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe: 此代码对我不起作用,因为在我的情况下,质心类似于以下代码,因为我只有一个要素数据框:

array([[11899.90692187],
       [ 5406.54143126]])

In this case, what is the correct approach to find the distance between centroid and points? 在这种情况下,找到质心和点之间距离的正确方法是什么? Is it possible? 可能吗?

Thank you and sorry for the trivial question, i'm still learning 谢谢你,对不起这个小问题,我还在学习

There's scipy.spatial.distance_matrix you can make use of: 您可以使用scipy.spatial.distance_matrix

# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))

# make it a dataframe
df = pd.DataFrame(df)

# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)

# get centroids
centroids = km.cluster_centers_

# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)

gives

在此处输入图片说明

Now the distance matrix: 现在,距离矩阵为:

from scipy.spatial import distance_matrix

dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))

You can confirm that this is correct by 您可以通过以下方式确认这是正确的

dist_mat.idxmin(axis=1) == preds

And finally, the mean distance to centroids: 最后,到质心的平均距离:

dist_mat.groupby(preds).mean()

gives: 得到:

          0         1         2
0  0.243367  0.525194  0.571674
1  0.525350  0.228947  0.575169
2  0.560297  0.573860  0.197556

where the columns denote the centroid number and rows denote the mean distance of the points in a cluster. 其中列表示质心数,行表示聚类中点的平均距离。

You can use scipy.spatial.distance.cdist to create a distance matrix: 您可以使用scipy.spatial.distance.cdist创建距离矩阵:

from scipy.spatial.distance import cdist
dm = cdist(df, centroids)

This should give you a 2-d array, where each row represents an observation in your original dataset, and each column represents a centroid. 这应该给您一个二维数组,其中每一行代表原始数据集中的观测值,每一列代表一个质​​心。 The x-th row in the y-th column gives the distance between your x-th observation to your y-th cluster centroid. 第y列中的第x行给出了第x个观测值与第y个聚类质心之间的距离。 cdist uses Euclidean distance by default, but you can use other metrics (not that it matters much for a dataset with only one feature). 默认情况下, cdist使用欧几里得距离,但您可以使用其他度量标准(对于仅具有一个特征的数据集来说,并不是很重要)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM