简体   繁体   中英

Calculating TF-IDF Score of a Single String

I do a string matching using TF-IDF and Cosine Similarity and it's working good for finding the similarity between strings in a list of strings.

Now, I want to do the matching between a new string against the previously calculated matrix. I calculate the TF-IDF score using below code.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(list_string)

How can I calculate the TF-IDF score of a new string against previous matrix? I can add the new string to the series and recalculate the matrix like below, but it will be inefficient since I only want the last index of the matrix and don't need the matrix of the old series to be recalculated.

list_string = list_string.append(new_string)

single_matrix = vectorizer.fit_transform(list_string)

single_matrix = single_matrix[len(list_string) - 1:]

After reading a while about TF-IDF calculation, I am thinking about saving the IDF value of each term and manually calculate the TF-IDF of the new string without using the matrix, but I don't know how to do that. How can I do this? Or is there any better way?

Refitting the TF-IDF in order to calculate the score of a single entry is not the way; you should simply use the .transform() method of the existing fitted vectorizer to your new string ( not to the whole matrix):

single_entry = vectorizer.transform(new_string)

See the docs .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM