简体   繁体   中英

Compare document pairs within corpus using TF-IDF - Python

I managed to calculate the TF-IDF and matrix using the following code:

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words='english',
                             use_idf=True, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(paragraphs) #fit the vectorizer to paragraphs

However, I would now like to compare the similarity of different paragraphs, my end result should look like this:

Pair# | Paragraph1 | Paragraph2 | Similarity score

1 --------xyz --------xyz --------- 30.2%

2 --------xyz --------xyz --------- 22.3%

3  --------xyz --------xyz --------- 4.3%

How can I use the TF-IDF matrix to compare the different paragraph pairs?

Assuming that each paragraph in your paragraphs parameter is a string, then each row in your tfidf_matrix would be a numeric vector representing that string. A common metric for measuring the similarity between vectors (and specifically tf-idf weight vectors) is cosine similarity . One useful implementation is the scikit-learn cosine_similarity method which accepts matrices as inputs.

So presumably you could do:

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

Every cell i, j will be the similarity score between paragraphs i and j .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM