如何計算 python 中每個質心的密度？

Question

我有 kmeans 聚類數據和 kmeans 聚類質心。 我想計算每個簇質心的密度並刪除最高簇質心密度的簇。 我做了我的研究，並找到了這個公式。

N(c) 是集群 c 的一組鄰居集群質心，應該是 5 我嘗試實現該算法但無法實現。 你能幫我實現它嗎？

到目前為止，這是我的代碼：

df = make_blobs(n_samples=5000, n_features=15,centers=15, cluster_std=1,random_state=10)
X,y=df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
TrainData=X_train,y_train
n_clusters_sampling=10
 
kmeans2 = KMeans(n_clusters = n_clusters_sampling,random_state=10)
kmeans2.fit(X_train)
centroids = kmeans2.cluster_centers_

Answer 1

您的問題本質上是由質心形成的“新數據集”上的“k-最近鄰搜索”。 您想要每個質心及其相關距離的 5 個最接近的。 幸運的是， sklearn確實有提供此功能的NearestNeighbors class：

...
centroids = kmeans2.cluster_centers_

from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=6) # 6 is not a typo. Explanation below.
nn.fit(centroids)
neigh_dist, neigh_ind = nn.kneighbors(centroids, return_distance=True)
densities = [5/np.sum(neigh_dist[i,:]) for i in range(centroids.shape[0])]
print(densities)

請注意，我們正在使用我們正在執行查詢的相同數據點（質心）來擬合nn object。 這就是n_neighbors為 6 的原因：對於每個質心，它自己將是距離為零的最近鄰居。
.kneighbors()方法，當return_distance設置為 True 時，（也）返回形狀距離的數組（ n ， n_neighbors ），其中n是查詢點的數量 - 即質心。 該數組的i , j單元告訴您鄰居j與質心i的距離。 因此，我們按照您發布的公式取每行的平均值來計算密度。

編輯：答案的下一部分解決了 OP 關於刪除最高密度集群的評論。

刪除一個集群，比如c本質上意味着將其數據點的集群標簽重新分配到下一個最近的質心。 所以，現在我們有一個新的 1-最近鄰問題，我們可以再次使用我們創建的 NearestNeihbors object。

我們在“質心數據集”上對最初分配給c的點執行 2 最近鄰搜索。
第一個鄰居當然是c ，所以我們只保留第二個最近的鄰居。
然后我們只需使用新索引更新這些數據點的原始分配表。

# run k-means and get an array of initial cluster assignments.
assignments = kmeans2.predict(X_train)

# find the index of cluster to be removed
c = np.argmax(densities)

# for each point originally assigned to c, find its closest centroid.
# Again we are using the trick of searching for one more neighbor since we know
# the closest centroid of those points will be c.
nearest_centroids = nn.kneighbors(X_train[assignments==c,:], n_neighbors=2, return_distance=False)

# get the new closest cenroid (that is, the second column of the array) and make it 1D
nearest_centroids = nearest_centroids[:,1].flatten()

# simply update the initial assignment table for the specific datapoints
assignments[assignments==c] = nearest_centroids

assignments數組現在不包含c的值。 請注意，在繪制或對結果進行其他后處理時，這可能會留下一個“洞”，因為會有一個沒有分配點的集群。 如果您想避免這種情況，只需將高於 c 的索引減去 1：

assignments = np.array([i-1 if i>c else i for i in assignments])

如果您還想刪除質心：

centroids = np.delete(centroids, c, axis=0) # remove row from numpy array by index

Answer 2

你可以用曼哈頓距離來做，公式是 d(p, q) = d(q,p) = Sum (|qi-pi|)

def ManhattanDistance(x, y):
    S = 0;
    for i in range(len(x)):
        S += abs(int(x[i])- int(y[i]))

    return S

和她你可以得到射擊距離

def Classify(centroieds, item):
    minimum =10000;
    for i in range(len(centroieds)):

        # Find distance from item to mean
        dis = ManhattanDistance(item, centroieds[i]);

        if (dis < minimum):
            minimum = dis;
            index = i;

    return index;

和她你可以找到適合的 clastre 我把所有的集群都放在一個字典里

def FindClusters(centroieds):
    clusters = {} # Init clusters

    for i in range(200):
        item = l[i]

        # Classify item into a cluster
        index = Classify(means, item);

        # Add item to cluster
        if index in clusters.keys():
            clusters[index].append()
        else:
            clusters[index] = l[i]
            clusters[index].append(l[i])

   
    return(clusters)

這不是我的全部代碼，它是它的一部分，我希望它有所幫助。

如何計算 python 中每個質心的密度？

問題描述

2 個解決方案

解決方案1
1 已采納 2020-12-27 02:26:12

解決方案2
0 2020-12-26 23:04:07

如何計算 python 中每個質心的密度？

問題描述

2 個解決方案

解決方案1 1 已采納 2020-12-27 02:26:12

解決方案2 0 2020-12-26 23:04:07

解決方案1
1 已采納 2020-12-27 02:26:12

解決方案2
0 2020-12-26 23:04:07