[英]How to traverse a tree from sklearn AgglomerativeClustering?
[英]Plot dendrogram using sklearn.AgglomerativeClustering
我正在嘗試使用AgglomerativeClustering
提供的children_
屬性構建樹狀圖,但到目前為止我不走運。 我不能使用scipy.cluster
因為在提供合並聚類scipy
缺乏一些選項,對我很重要(如指定集群的量的選項)。 如果有任何建議,我將不勝感激。
import sklearn.cluster
clstr = cluster.AgglomerativeClustering(n_clusters=2)
clusterer.children_
前段時間我遇到了完全相同的問題。 我設法繪制該死的樹狀圖的方法是使用軟件包ete3 。 這個包能夠靈活地繪制具有各種選項的樹。 唯一的困難是要轉換sklearn
的children_
輸出到Newick樹格式,可以閱讀和理解ete3
。 此外,我需要手動計算樹突的跨度,因為該信息未隨children_
提供。 這是我使用的代碼片段。 它計算 Newick 樹,然后顯示ete3
樹數據結構。 有關如何繪圖的更多詳細信息,請查看此處
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import ete3
def build_Newick_tree(children,n_leaves,X,leaf_labels,spanner):
"""
build_Newick_tree(children,n_leaves,X,leaf_labels,spanner)
Get a string representation (Newick tree) from the sklearn
AgglomerativeClustering.fit output.
Input:
children: AgglomerativeClustering.children_
n_leaves: AgglomerativeClustering.n_leaves_
X: parameters supplied to AgglomerativeClustering.fit
leaf_labels: The label of each parameter array in X
spanner: Callable that computes the dendrite's span
Output:
ntree: A str with the Newick tree representation
"""
return go_down_tree(children,n_leaves,X,leaf_labels,len(children)+n_leaves-1,spanner)[0]+';'
def go_down_tree(children,n_leaves,X,leaf_labels,nodename,spanner):
"""
go_down_tree(children,n_leaves,X,leaf_labels,nodename,spanner)
Iterative function that traverses the subtree that descends from
nodename and returns the Newick representation of the subtree.
Input:
children: AgglomerativeClustering.children_
n_leaves: AgglomerativeClustering.n_leaves_
X: parameters supplied to AgglomerativeClustering.fit
leaf_labels: The label of each parameter array in X
nodename: An int that is the intermediate node name whos
children are located in children[nodename-n_leaves].
spanner: Callable that computes the dendrite's span
Output:
ntree: A str with the Newick tree representation
"""
nodeindex = nodename-n_leaves
if nodename<n_leaves:
return leaf_labels[nodeindex],np.array([X[nodeindex]])
else:
node_children = children[nodeindex]
branch0,branch0samples = go_down_tree(children,n_leaves,X,leaf_labels,node_children[0])
branch1,branch1samples = go_down_tree(children,n_leaves,X,leaf_labels,node_children[1])
node = np.vstack((branch0samples,branch1samples))
branch0span = spanner(branch0samples)
branch1span = spanner(branch1samples)
nodespan = spanner(node)
branch0distance = nodespan-branch0span
branch1distance = nodespan-branch1span
nodename = '({branch0}:{branch0distance},{branch1}:{branch1distance})'.format(branch0=branch0,branch0distance=branch0distance,branch1=branch1,branch1distance=branch1distance)
return nodename,node
def get_cluster_spanner(aggClusterer):
"""
spanner = get_cluster_spanner(aggClusterer)
Input:
aggClusterer: sklearn.cluster.AgglomerativeClustering instance
Get a callable that computes a given cluster's span. To compute
a cluster's span, call spanner(cluster)
The cluster must be a 2D numpy array, where the axis=0 holds
separate cluster members and the axis=1 holds the different
variables.
"""
if aggClusterer.linkage=='ward':
if aggClusterer.affinity=='euclidean':
spanner = lambda x:np.sum((x-aggClusterer.pooling_func(x,axis=0))**2)
elif aggClusterer.linkage=='complete':
if aggClusterer.affinity=='euclidean':
spanner = lambda x:np.max(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2))
elif aggClusterer.affinity=='l1' or aggClusterer.affinity=='manhattan':
spanner = lambda x:np.max(np.sum(np.abs(x[:,None,:]-x[None,:,:]),axis=2))
elif aggClusterer.affinity=='l2':
spanner = lambda x:np.max(np.sqrt(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2)))
elif aggClusterer.affinity=='cosine':
spanner = lambda x:np.max(np.sum((x[:,None,:]*x[None,:,:]))/(np.sqrt(np.sum(x[:,None,:]*x[:,None,:],axis=2,keepdims=True))*np.sqrt(np.sum(x[None,:,:]*x[None,:,:],axis=2,keepdims=True))))
else:
raise AttributeError('Unknown affinity attribute value {0}.'.format(aggClusterer.affinity))
elif aggClusterer.linkage=='average':
if aggClusterer.affinity=='euclidean':
spanner = lambda x:np.mean(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2))
elif aggClusterer.affinity=='l1' or aggClusterer.affinity=='manhattan':
spanner = lambda x:np.mean(np.sum(np.abs(x[:,None,:]-x[None,:,:]),axis=2))
elif aggClusterer.affinity=='l2':
spanner = lambda x:np.mean(np.sqrt(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2)))
elif aggClusterer.affinity=='cosine':
spanner = lambda x:np.mean(np.sum((x[:,None,:]*x[None,:,:]))/(np.sqrt(np.sum(x[:,None,:]*x[:,None,:],axis=2,keepdims=True))*np.sqrt(np.sum(x[None,:,:]*x[None,:,:],axis=2,keepdims=True))))
else:
raise AttributeError('Unknown affinity attribute value {0}.'.format(aggClusterer.affinity))
else:
raise AttributeError('Unknown linkage attribute value {0}.'.format(aggClusterer.linkage))
return spanner
clusterer = AgglomerativeClustering(n_clusters=2,compute_full_tree=True) # You can set compute_full_tree to 'auto', but I left it this way to get the entire tree plotted
clusterer.fit(X) # X for whatever you want to fit
spanner = get_cluster_spanner(clusterer)
newick_tree = build_Newick_tree(clusterer.children_,clusterer.n_leaves_,X,leaf_labels,spanner) # leaf_labels is a list of labels for each entry in X
tree = ete3.Tree(newick_tree)
tree.show()
對於那些願意離開 Python 並使用強大的 D3 庫的人來說,使用d3.cluster()
(或者,我猜是d3.tree()
)API 來實現一個不錯的、可定制的結果並不是特別困難。
有關演示,請參閱jsfiddle 。
幸運的是children_
數組很容易作為 JS 數組運行,唯一的中間步驟是使用d3.stratify()
將其轉換為分層表示。 具體來說,我們需要每個節點都有一個id
和一個parentId
:
var N = 272; // Your n_samples/corpus size.
var root = d3.stratify()
.id((d,i) => i + N)
.parentId((d, i) => {
var parIndex = data.findIndex(e => e.includes(i + N));
if (parIndex < 0) {
return; // The root should have an undefined parentId.
}
return parIndex + N;
})(data); // Your children_
由於findIndex
行,您最終在這里至少有 O(n^2) 行為,但在您的 n_samples 變得巨大之前這可能無關緊要,在這種情況下,您可以預先計算更有效的索引。
除此之外,它幾乎是d3.cluster()
的d3.cluster()
。 請參閱 mbostock 的規范塊或我的 JSFiddle。
注意對於我的用例,僅顯示非葉節點就足夠了; 可視化樣本/葉子有點棘手,因為這些可能並不都明確地在children_
數組中。
來自官方文檔:
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
iris = load_iris()
X = iris.data
# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(X)
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode='level', p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
請注意,當前(從 scikit-learn v0.23 開始)僅在使用distance_threshold
參數調用 AgglomerativeClustering 時才有效,但從 v0.24 開始,您將能夠通過將compute_distances
設置為 true 來強制計算距離( 請參閱 nightly build文檔)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.