简体   繁体   中英

Use sklearn to find string similarity between two texts with large group of documents

Given a large set of documents (book titles, for example), how to compare two book titles that are not in the original set of documents, or without recomputing the entire TF-IDF matrix?

For example,

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

book_titles = ["The blue eagle has landed",
         "I will fly the eagle to the moon",
         "This is not how You should fly",
         "Fly me to the moon and let me sing among the stars",
         "How can I fly like an eagle",
         "Fixing cars and repairing stuff",
         "And a bottle of rum"]

vectorizer = TfidfVectorizer(stop_words='english', norm='l2', sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(book_titles) 

To check the similarity between the first and the second book titles, one would do

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

and so on. This considers that the TF-IDF will be calculated with respect all the entries in the matrix, so the weights will be proportional to the number of times a token appears in all corpus.

Let's say now that two titles should be compared, title1 and title2, that are not in the original set of book titles. The two titles can be added to the book_titles collection and compared afterwards, so the word "rum", for example, will be counted including the one in the previous corpus:

title1="The book of rum"
title2="Fly safely with a bottle of rum"
book_titles.append(title1, title2)
tfidf_matrix = vectorizer.fit_transform(book_titles)
index = tfidf_matrix.shape()[0]
cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])

what is really impratical and very slow if documents grow very large or need to be stored out of memory. What can be done in this case? If I compare only between title1 and title2, the previous corpus will not be used.

Why do you append them to the list and recompute everything? Just do

new_vectors = vectorizer.transform([title1, title2])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM