简体   繁体   中英

How reliable is the Elbow curve in finding K in K-Means?

So I was trying to use the Elbow curve to find the value of optimum 'K' (number of clusters) in K-Means clustering.

The clustering was done for the average vectors (using Word2Vec) of a text column in my dataset (1467 rows). But looking at my text data, I can clearly find more than 3 groups the data can be grouped into.

I read the reasoning is to have a small value of k while keeping the Sum of Squared Errors (SSE) low. Can somebody tell me how reliable the Elbow Curve is? Also if there's something I'm missing.

Attaching the Elbow curve for reference. I also tried plotting it up to 70 clusters, exploratory. 在此处输入图片说明 .

在此处输入图片说明

The "Elbow" is not even well defined. So how can it be reliable?

You can "normalize" the values by the expected dropoff from splitting the data into k clusters and it will become a bit more readable. Unfortunately, I forgot the exact name of that.Calinski and Harabasz (1974) variance ratio criterion? If I recall the name correctly, that is essentially a rescaled version that makes much more sense.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM