简体   繁体   English

应该给出什么作为链接函数的输入-tfidf矩阵或tfidf矩阵的不同元素之间的相似性?

[英]What should be given as an input to linkage function - tfidf matrix or similarity between different elements of tfidf matrixes?

I have the following python notebook which aims to cluster different groups of abstracts based on the similarity between their text. 我有以下python笔记本,旨在根据其文本之间的相似性对不同的摘要组进行聚类。 I have two approaches here: one to use tfidf numpy array of documents as it is in the linkage function and second is to find the similarity between the tfidf array of different documents and then to use that similarity matrix for clustering. 我在这里有两种方法:一种是使用tfidf numpy数组的文档,因为它在链接函数中,而第二种是找到不同文档的tfidf数组之间的相似性,然后使用该相似性矩阵进行聚类。 I am unable to understand which one is correct. 我无法理解哪个是正确的。

Approach 1: 方法1:

I used cosine_similarity to find out the similarity matrix (cosine) of tfidf matrix. 我使用余弦相似度来找出tfidf矩阵的相似度矩阵(余弦) I first converted the redundant matrix (cosine) into the condensed distance matrix (distance_matrix) using squareform function. 我首先使用平方函数将冗余矩阵(余弦)转换为压缩距离矩阵(distance_matrix) Then distance_matrix is fed into linkage function and using Dendograms I have plotted the graph. 然后将distance_matrix输入到链接函数中,并使用树状图绘制了该图。

Approach 2: 方法二:

I used the condensed form of tfidf numpy array into the linkage function and plotted the dendograms. 我将tfidf numpy数组的压缩形式用于链接函数,并绘制了树状图。

My question is what is correct? 我的问题是正确的吗? According to the data as far as i can understand, the approach 2 seems to be correct, but to me approach 1 makes sense. 根据我所能理解的数据,方法2似乎是正确的,但对我来说方法1是有意义的。 It would be great if someone can explain me what is right here in this scenario. 如果有人可以解释我在这种情况下的正确之处,那将是很好的。 Thanks in advance. 提前致谢。

Let me know if anything remains unclear in the question. 让我知道问题中是否还有任何不清楚的地方。

import pandas, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

###Data Cleaning

stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
df=pandas.read_csv('WIPO_CSV.csv')


import sys
reload(sys)
sys.setdefaultencoding('utf8')


documents_no_stopwords=[]

def preprocessing(word):
    tokens = tokenizer.tokenize(word)

    processed_words = []
    for w in tokens:
        if w in stop_words:
            continue
        else:
            processed_words.append(w)

***This step creates a list of text documents with only the nouns in    them***
    documents_no_stopwords.append(' '.join(processed_words))

for text in df['TEXT'].tolist():
    preprocessing(text)

***Converting into tfidf form***
*Latin is used as utf8 decoder was facing some trouble with the text.*

vectoriser = TfidfVectorizer(encoding='latin1')

***we have numpy here which is in normalised form***

tfidf_documents = vectoriser.fit_transform(documents_no_stopwords)


##Cosine Similarity as the input to linkage should be a distance vector

from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform

cosine = cosine_similarity(tfidf_documents)
distance_matrix = squareform(cosine,force='tovector',checks=False)

from scipy.cluster.hierarchy import dendrogram, linkage

##Linkage based on tfidf of each document
z_num=linkage(tfidf_documents.todense(),'ward')

z_num  #tfidf

array([[11.        , 12.        ,  0.        ,  2.        ],
   [18.        , 19.        ,  0.        ,  2.        ],
   [20.        , 31.        ,  0.        ,  3.        ],
   [21.        , 32.        ,  0.        ,  4.        ],
   [22.        , 33.        ,  0.        ,  5.        ],
   [17.        , 34.        ,  0.38208619,  6.        ],
   [15.        , 28.        ,  1.19375843,  2.        ],
   [ 6.        ,  9.        ,  1.24241258,  2.        ],
   [ 7.        ,  8.        ,  1.27069483,  2.        ],
   [13.        , 37.        ,  1.28868301,  3.        ],
   [ 4.        , 24.        ,  1.30850122,  2.        ],
   [36.        , 39.        ,  1.32090275,  5.        ],
   [10.        , 16.        ,  1.32602346,  2.        ],
   [27.        , 38.        ,  1.32934025,  3.        ],
   [23.        , 25.        ,  1.32987072,  2.        ],
   [ 3.        , 29.        ,  1.35143582,  2.        ],
   [ 5.        , 14.        ,  1.35401753,  2.        ],
   [26.        , 42.        ,  1.35994878,  3.        ],
   [ 2.        , 45.        ,  1.40055438,  3.        ],
   [ 0.        , 40.        ,  1.40811825,  3.        ],
   [ 1.        , 46.        ,  1.41383622,  3.        ],
   [44.        , 50.        ,  1.4379821 ,  5.        ],
   [41.        , 43.        ,  1.44575227,  8.        ],
   [48.        , 51.        ,  1.45876241,  8.        ],
   [49.        , 53.        ,  1.47130328, 11.        ],
   [47.        , 52.        ,  1.49944936, 11.        ],
   [54.        , 55.        ,  1.69814818, 22.        ],
   [30.        , 56.        ,  1.91299937, 24.        ],
   [35.        , 57.        ,  3.1967033 , 30.        ]])

from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
dn = dendrogram(z_num)
plt.show()

Linkage based on similarity 基于相似度的链接

z_sim=linkage(distance_matrix,'ward')
z_sim  *Cosine Similarity*

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [2.00000000e+00, 3.00000000e+01, 0.00000000e+00, 3.00000000e+00],
   [1.70000000e+01, 3.10000000e+01, 0.00000000e+00, 4.00000000e+00],
   [3.00000000e+00, 4.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [1.00000000e+01, 3.30000000e+01, 0.00000000e+00, 3.00000000e+00],
   [5.00000000e+00, 7.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [6.00000000e+00, 1.80000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.10000000e+01, 1.90000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.20000000e+01, 2.00000000e+01, 0.00000000e+00, 2.00000000e+00],
   [8.00000000e+00, 2.40000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.60000000e+01, 2.10000000e+01, 0.00000000e+00, 2.00000000e+00],
   [2.20000000e+01, 2.70000000e+01, 0.00000000e+00, 2.00000000e+00],
   [9.00000000e+00, 2.90000000e+01, 0.00000000e+00, 2.00000000e+00],
   [2.60000000e+01, 4.20000000e+01, 0.00000000e+00, 3.00000000e+00],
   [1.40000000e+01, 3.40000000e+01, 3.97089886e-03, 4.00000000e+00],
   [2.30000000e+01, 4.40000000e+01, 1.81733052e-02, 5.00000000e+00],
   [3.20000000e+01, 3.50000000e+01, 2.14592323e-02, 6.00000000e+00],
   [2.50000000e+01, 4.00000000e+01, 2.84944415e-02, 3.00000000e+00],
   [1.30000000e+01, 4.70000000e+01, 5.02045376e-02, 4.00000000e+00],
   [4.10000000e+01, 4.30000000e+01, 5.10902795e-02, 5.00000000e+00],
   [3.70000000e+01, 4.50000000e+01, 5.40176402e-02, 7.00000000e+00],
   [3.80000000e+01, 3.90000000e+01, 6.15118462e-02, 4.00000000e+00],
   [1.50000000e+01, 4.60000000e+01, 7.54874869e-02, 7.00000000e+00],
   [2.80000000e+01, 5.00000000e+01, 9.55487454e-02, 8.00000000e+00],
   [5.20000000e+01, 5.30000000e+01, 3.86911095e-01, 1.50000000e+01],
   [4.90000000e+01, 5.40000000e+01, 4.16693529e-01, 2.00000000e+01],
   [4.80000000e+01, 5.50000000e+01, 4.58764920e-01, 2.40000000e+01],
   [3.60000000e+01, 5.60000000e+01, 5.23422380e-01, 2.60000000e+01],
   [5.10000000e+01, 5.70000000e+01, 5.49419077e-01, 3.00000000e+01]])

from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
dn = dendrogram(z_sim)
plt.show()

accuracy for clustering of data is compared with this photo: https://drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing 数据聚类的准确性与此照片进行了比较: https : //drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing

The dendogram that I got are available in the following notebook link: https://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing open this html using internet browser. 我得到的树状图可在以下笔记本链接中找到: https ://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing使用互联网浏览器打开此html。

Scipy only supports distances for HAC, not similarities. Scipy仅支持HAC的距离 ,不支持相似性。

Then the results should be the same. 那么结果应该是相同的。 So there is no "right" or "wrong". 因此,没有“对”或“错”。

At some point you need the distance matrix in linearized form. 在某些时候,您需要线性化的距离矩阵。 It is probably most efficient to use a) a method that can process sparse data (avoiding any todense call), and b) directly produces the linearize form, rather than generating the entire matrix and then dropping half of it. 使用a)可以处理稀疏数据的方法(避免任何todense调用)和b)直接生成线性化形式,而不是生成整个矩阵,然后丢弃其中的一半,可能是最有效的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM