[英]Compare document pairs within corpus using TF-IDF - Python
I managed to calculate the TF-IDF and matrix using the following code: 我设法使用以下代码来计算TF-IDF和矩阵:
from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(paragraphs) #fit the vectorizer to paragraphs
However, I would now like to compare the similarity of different paragraphs, my end result should look like this: 但是,我现在想比较不同段落的相似性,我的最终结果应如下所示:
Pair# | Paragraph1 | Paragraph2 | Similarity score
1 --------xyz --------xyz --------- 30.2%
2 --------xyz --------xyz --------- 22.3%
3 --------xyz --------xyz --------- 4.3%
How can I use the TF-IDF
matrix to compare the different paragraph pairs? 如何使用
TF-IDF
矩阵比较不同的段落对?
Assuming that each paragraph in your paragraphs
parameter is a string, then each row in your tfidf_matrix
would be a numeric vector representing that string. 假设
paragraphs
参数中的每个段落都是一个字符串,那么tfidf_matrix
每一行都是代表该字符串的数字矢量。 A common metric for measuring the similarity between vectors (and specifically tf-idf weight vectors) is cosine similarity . 度量向量(尤其是tf-idf权重向量)之间的相似性的通用度量是余弦相似度 。 One useful implementation is the scikit-learn cosine_similarity method which accepts matrices as inputs.
scikit-learn cosine_similarity方法是一种有用的实现,该方法接受矩阵作为输入。
So presumably you could do: 因此,大概可以做到:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
Every cell i, j
will be the similarity score between paragraphs i
and j
. 每个单元格
i, j
将是第i
和j
段之间的相似度得分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.