简体   繁体   English

与聚类混淆

[英]Confused with Clustering

I am getting so confused with clustering in data science process.我对数据科学过程中的聚类感到很困惑。 We know that the process of grouping similar points in a 2D space is based on this formula:我们知道在二维空间中对相似点进行分组的过程是基于这个公式的:

distance = sqrt( (x2-x1)^2 + (y2-y1)^2 )

But in introducing inputs to the sklearn we just feed the x-axis values :( what happened to the y-axis values?但是在向 sklearn 引入输入时,我们只提供 x 轴值:( y 轴值发生了什么?

for example we have the following data base:例如,我们有以下数据库:

index    x     y
------------------
  0      5     8
  1      6     9
  2      7     10

and we introduce x to the KMeans我们将 x 引入 KMeans

from sklearn.cluster import KMeans
kmeans = KMeans(2)
kmeans.fit(df["x"])

How can it calculate distance without having y values?它如何在没有 y 值的情况下计算距离?

KMeans clustering can be done in any number of dimensions. KMeans 聚类可以在任意数量的维度上进行。 As you said, the distance can be calculated using the Euclidean distance .正如您所说,可以使用欧几里得距离计算距离 This distance can be calculated for any number of dimension.可以针对任意数量的维度计算此距离。 You passed one array, so in this case it's just one dimension, so the formula would simplify to:您传递了一个数组,因此在这种情况下它只是一维,因此公式将简化为:

distance = sqrt((x2-x1)^2)

Which is really just the absolute value of (x2-x1)这实际上只是 (x2-x1) 的绝对值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM