![](/img/trans.png)
[英]How to find optimal number of clusters in hierarchical clustering using Gap statistic?
[英]How to get the optimal number of clusters using hierarchical cluster analysis automatically in python?
我想使用层次聚类分析来自动获得最佳聚类数(K),然后将此 K 应用于 python 中的K-means 聚类。
学习了很多文章,我知道有些方法告诉我们可以绘制图形来确定K,但是有什么方法可以在python中自动输出实数?
层次聚类方法是基于树状图来确定最佳聚类数。 使用类似于以下的代码绘制树状图:
# General imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage
# Load data, fill in appropriately
X = []
# How to cluster the data, single is minimal distance between clusters
linked = linkage(X, 'single')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation='top',
labels=labelList,
distance_sort='descending',
show_leaf_counts=True)
plt.show()
在树状图中找到节点之间最大的垂直差异,并在中间通过一条水平线。 与它相交的垂直线数是最佳簇数(当使用链接中设置的方法计算亲和力时)。
请参阅此处的示例: https : //stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/
如何自动读取树状图并提取该数字是我也想知道的。
在编辑中添加:有一种方法可以使用 SK Learn 包来做到这一点。 请参阅以下示例:
#==========================================================================
# Hierarchical Clustering - Automatic determination of number of clusters
#==========================================================================
# General imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from os import path
# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
# %matplotlib inline
print("============================================================")
print(" Hierarchical Clustering demo - num of clusters ")
print("============================================================")
print(" ")
folder = path.dirname(path.realpath(__file__)) # set current folder
# Load data
customer_data = pd.read_csv( path.join(folder, "hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv"))
# print(customer_data.shape)
print("In this data there should be 5 clusters...")
# Retain only the last two columns
data = customer_data.iloc[:, 3:5].values
# # Plot dendrogram using SciPy
# plt.figure(figsize=(10, 7))
# plt.title("Customer Dendograms")
# dend = shc.dendrogram(shc.linkage(data, method='ward'))
# plt.show()
# Initialize hiererchial clustering method, in order for the algorithm to determine the number of clusters
# put n_clusters=None, compute_full_tree = True,
# best distance threshold value for this dataset is distance_threshold = 200
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='ward', compute_full_tree=True, distance_threshold=200)
# Cluster the data
cluster.fit_predict(data)
print(f"Number of clusters = {1+np.amax(cluster.labels_)}")
# Display the clustering, assigning cluster label to every datapoint
print("Classifying the points into clusters:")
print(cluster.labels_)
# Display the clustering graphically in a plot
plt.scatter(data[:,0],data[:,1], c=cluster.labels_, cmap='rainbow')
plt.title(f"SK Learn estimated number of clusters = {1+np.amax(cluster.labels_)}")
plt.show()
print(" ")
数据取自此处: https : //stackabuse.s3.amazonaws.com/files/hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.