使用相似性函数来聚类scikit-learn

Question

I use a function to calculate similarity between a pair of documents and wanto perform clustering using this similarity measure. 我使用函数来计算一对文档之间的相似性，并且想要使用这种相似性度量来执行聚类。
Code so Far 代码到目前为止

Sim=np.zeros((n, n)) # create a numpy arrary  
i=0  
j=0       
for i in range(0,n):      
   for j in range(i,n):  
    if i==j:  
        Sim[i][j]=1
     else:    
         Sim[i][j]=simfunction(list_doc[i],list_doc[j]) # calculate similarity between documents i and j using simfunction
Sim=Sim+ Sim.T - np.diag(Sim.diagonal()) # complete the symmetric matrix

AggClusterDistObj=AgglomerativeClustering(n_clusters=num_cluster,linkage='average',affinity="precomputed") 
Res_Labels=AggClusterDistObj.fit_predict(Sim)

My concern is that here I used a similarity function , and I think as per documents it should be a disimilarity matrix, how can I change it to dissimilarity matrix. 我担心的是，我在这里使用了相似度函数，我认为根据文档它应该是一个不相似矩阵，我怎样才能将它改为不相似矩阵。 Also what would be a more efficient way to do this. 还有什么是更有效的方法来做到这一点。

Answer 1

Please format your code correctly , as indentation matters in Python. 请正确格式化代码 ，因为缩进在Python中很重要。
If possible, keep the code complete (you left out a import numpy as np ). 如果可能的话，保持代码完整（你省略了import numpy as np ）。
Since range always starts from zero, you can omit it and write range(n) . 由于range始终从零开始，因此可以省略它并写入range(n) 。
Indexing in numpy works like [i, j, k, ...]. 像n [i，j，k，...]这样的numpy索引。
So instead of Sim[i][j] you actually want to write Sim[i, j] , because otherwise you do two operations: first taking the entire row slice and then indexing the column. 因此，您实际上想要编写Sim[i, j] ，而不是Sim[i][j] ，因为否则您会执行两个操作：首先获取整个行切片然后索引列。 Heres another way to copy the elements of the upper triangle to the lower one 另一种方法是将上三角形的元素复制到下三角形
```
 Sim = np.identity(n) # diagonal with ones (100 percent similarity) for i in range(n): for j in range(i+1, n): # +1 skips the diagonal Sim[i, j]= simfunction(list_doc[i], list_doc[j]) # Expand the matrix (copy triangle) tril = np.tril_indices_from(Sim, -1) # take lower & upper triangle's indices triu = np.triu_indices_from(Sim, 1) # (without diagonal) Sim[tril] = Sim[triu] 
```
Assumed tha you really have similarities within the range (0, 1) to convert your similarity matrix into a distance matrix you can then simply do 假设你真的在范围（0,1）内有相似性将你的相似性矩阵转换成距离矩阵，你可以简单地做
dm = 1 - Sim

This operation will be vectorized by numpy 这个操作将被numpy矢量化

使用相似性函数来聚类scikit-learn

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-10-02 07:02:14

使用相似性函数来聚类scikit-learn

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-10-02 07:02:14

解决方案1
5 已采纳 2014-10-02 07:02:14