How can I extract the distances between points within a dendogram in python?

Question

I am performing and hierarchical clustering in python and I obtain the dendogram plot. I was wondering if there is a way to extract the distances between closest point for example here: distances between 7 and 8 (the closest one), then distances between 0 and 1 and so on, To produce the plot I've used the function:

linkage_matrix= linkage(dfP, method="single") 

cluster_dict = dendrogram (linkage_matrix)

Answer 1

When you do

Z = hierarchy.linkage(X, method='single')

in Z matrix you have everything you need: cluster1, cluster2, distance, number of elements in the cluster.

For example

import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
import seaborn as sns

X = np.array([662., 877., 255., 412., 996., 295., 468., 268.,
                   400., 754., 564., 138., 219., 869., 669.])

Z = hierarchy.linkage(X, method='single')
plt.figure()
dn = hierarchy.dendrogram(Z)

and we have Z

array([[  2.,   5., 138.,   2.],
       [  3.,   4., 219.,   2.],
       [  0.,   7., 255.,   3.],
       [  1.,   8., 268.,   4.],
       [  6.,   9., 295.,   6.]])

since we have only 6 elements, 0 to 5 are single elements, from 6 on they are clusters of elements

6 is the first cluster (2,5) of 2 elements
7 is the second cluster (3,4) of 2 elements
8 is the third cluster (0,7), ie (0,(3,4)) of 3 elements
9 is fourth cluster (1,8), ie (1,(0,(3,4))) of 4 elemets

then we have (6,9) ie ((2,5),(1,(0,(3,4)))) of 6 elements

clusters = {
    0: '0',
    1: '1',
    2: '2',
    3: '3',
    4: '4',
    5: '5',
    6: '2,5',
    7: '3,4',
    8: '0,3,4',
    9: '1,0,3,4',
}

now we can build a df to display the heatmap

# init the DataFrame
df = pd.DataFrame(
    columns=Z[:,0].astype(int), 
    index=Z[:,1].astype(int)
)

df.columns = df.columns.map(clusters)
df.index = df.index.map(clusters)

# populate the diagonal
for i, d in enumerate(Z[:,2]):
    df.iloc[i, i] = d

# fill NaN
df.fillna(0, inplace=True)
# mask everything but diagonal
mask = np.ones(df.shape, dtype=bool)
np.fill_diagonal(mask, 0)

# plot the heatmap
sns.heatmap(df, 
            annot=True, fmt='.0f', cmap="YlGnBu", 
            mask=mask)
plt.show()

update

I defined X as an array of distances. These are the values of the nilpotent lower triangular matrix of distances between elements, by column.

We can verify

# number of elements
n = (np.sqrt(8 * X.size + 1) + 1) / 2
n
6.0

we have n=6 elements and here's the nilpotent lower triangular matrix of distances

# init the DataFrame
df = pd.DataFrame(columns=range(int(n)), index=range(int(n)))
# populate the DataFrame
idx = 0
for c in range(int(n)-1):
    for r in range(c+1, int(n)):
        df.iloc[r, c] = X[idx]
        idx += 1
# fill NaNs and mask
df.fillna(0, inplace=True)
mask = np.zeros_like(df)
mask[np.triu_indices_from(mask)] = True
# plot the matrix
sns.heatmap(df, annot=True, fmt='.0f', cmap="YlGnBu", mask=mask)
plt.show()

update 2

How to automatically populate the map dictionary for the clusters distances diagonal matrix.

First we have to calculate the number of elements (needed only if X is an array of distances) as we saw earlier

# number of elements
n = (np.sqrt(8 * X.size + 1) + 1) / 2

then, we can loop through Z matrix to populate the dictionary

# clusters of single elements
clusters = {i: str(i) for i in range(int(n))}
# loop through Z matrix
for i, z in enumerate(Z.astype(int)):
    # cluster number
    cluster_num = int(n+i)
    # elements in clusters
    cluster_names = [clusters[z[0]], clusters[z[1]]]
    cluster_elements = [str(i) for i in cluster_names]
    # update the dictionary
    clusters.update({cluster_num: ','.join(cluster_elements)})

and we have

clusters

{0: '0',
 1: '1',
 2: '2',
 3: '3',
 4: '4',
 5: '5',
 6: '2,5',
 7: '3,4',
 8: '0,3,4',
 9: '1,0,3,4',
 10: '2,5,1,0,3,4'}

How can I extract the distances between points within a dendogram in python?

Question

1 answers

solution1
3 ACCPTED 2021-03-31 12:06:52

update

update 2

How can I extract the distances between points within a dendogram in python?

Question

1 answers

solution1 3 ACCPTED 2021-03-31 12:06:52

update

update 2

solution1
3 ACCPTED 2021-03-31 12:06:52