简体   繁体   English

计算单个字符串的 TF-IDF 分数

[英]Calculating TF-IDF Score of a Single String

I do a string matching using TF-IDF and Cosine Similarity and it's working good for finding the similarity between strings in a list of strings.我使用 TF-IDF 和余弦相似度进行字符串匹配,它可以很好地找到字符串列表中字符串之间的相似度。

Now, I want to do the matching between a new string against the previously calculated matrix.现在,我想在新字符串与先前计算的矩阵之间进行匹配。 I calculate the TF-IDF score using below code.我使用以下代码计算 TF-IDF 分数。

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(list_string)

How can I calculate the TF-IDF score of a new string against previous matrix?如何计算新字符串相对于先前矩阵的 TF-IDF 分数? I can add the new string to the series and recalculate the matrix like below, but it will be inefficient since I only want the last index of the matrix and don't need the matrix of the old series to be recalculated.我可以将新字符串添加到系列并重新计算矩阵,如下所示,但这将是低效的,因为我只想要矩阵的最后一个索引并且不需要重新计算旧系列的矩阵。

list_string = list_string.append(new_string)

single_matrix = vectorizer.fit_transform(list_string)

single_matrix = single_matrix[len(list_string) - 1:]

After reading a while about TF-IDF calculation, I am thinking about saving the IDF value of each term and manually calculate the TF-IDF of the new string without using the matrix, but I don't know how to do that.在阅读了一段时间关于TF-IDF计算的信息后,我正在考虑保存每个术语的IDF值并在不使用矩阵的情况下手动计算新字符串的TF-IDF,但我不知道该怎么做。 How can I do this?我怎样才能做到这一点? Or is there any better way?或者有没有更好的方法?

Refitting the TF-IDF in order to calculate the score of a single entry is not the way;重新拟合 TF-IDF 以计算单个条目的分数不是办法; you should simply use the .transform() method of the existing fitted vectorizer to your new string ( not to the whole matrix):您应该简单地将现有拟合矢量化器的.transform()方法用于您的新字符串(而不是整个矩阵):

single_entry = vectorizer.transform(new_string)

See the docs .请参阅文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM