使用TF-IDF比较语料库中的文档对-Python

Question

I managed to calculate the TF-IDF and matrix using the following code: 我设法使用以下代码来计算TF-IDF和矩阵：

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words='english',
                             use_idf=True, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(paragraphs) #fit the vectorizer to paragraphs

However, I would now like to compare the similarity of different paragraphs, my end result should look like this: 但是，我现在想比较不同段落的相似性，我的最终结果应如下所示：

Pair# | Paragraph1 | Paragraph2 | Similarity score

1 --------xyz --------xyz --------- 30.2%

2 --------xyz --------xyz --------- 22.3%

3  --------xyz --------xyz --------- 4.3%

How can I use the TF-IDF matrix to compare the different paragraph pairs? 如何使用TF-IDF矩阵比较不同的段落对？

Answer 1

Assuming that each paragraph in your paragraphs parameter is a string, then each row in your tfidf_matrix would be a numeric vector representing that string. 假设paragraphs参数中的每个段落都是一个字符串，那么tfidf_matrix每一行都是代表该字符串的数字矢量。 A common metric for measuring the similarity between vectors (and specifically tf-idf weight vectors) is cosine similarity . 度量向量（尤其是tf-idf权重向量）之间的相似性的通用度量是余弦相似度。 One useful implementation is the scikit-learn cosine_similarity method which accepts matrices as inputs. scikit-learn cosine_similarity方法是一种有用的实现，该方法接受矩阵作为输入。

So presumably you could do: 因此，大概可以做到：

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

Every cell i, j will be the similarity score between paragraphs i and j . 每个单元格i, j将是第i和j段之间的相似度得分。

使用TF-IDF比较语料库中的文档对-Python

问题描述

1 个解决方案

解决方案1
0 2018-05-29 00:52:13

使用TF-IDF比较语料库中的文档对-Python

问题描述

1 个解决方案

解决方案1 0 2018-05-29 00:52:13

解决方案1
0 2018-05-29 00:52:13