[英]Calculating the TF-IDF of a query string over a trained set of documents
我有一个代码,可以计算150个文档的TF-IDF矩阵。
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
all_lines = []
all_lines_corrected = []
with open("Extracted Functional Goals - Stemmed.txt") as f:
for line in f:
temp = line.split(None,1)
all_lines.append(temp[1])
f.close()
for a in range(len(all_lines)-1):
all_lines_corrected.append(all_lines[a][:-2])
all_lines_corrected.append(all_lines[len(all_lines)-1])
stop_words = stopwords.words('english')
tf = TfidfVectorizer(analyzer='word', stop_words = stop_words)
tfidf_matrix = tf.fit_transform(all_lines_corrected).todense()
query_string = raw_input("Enter string : ")
如何获取查询字符串的TF-IDF? (我们可以假设它看起来像是150个受过培训的文件的输入吗?)
您可以使用values = tf.transform([query_string])
获得查询字符串的tf-idf值。 结果将是一个具有1行N列的稀疏矩阵 ,其中这些列是矢量化器在训练文档中看到的N个唯一单词的tfidf值。
简短的示例,类似于您的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
all_lines = ["This is an example doc", "Another short example document .", "Just a third example"]
tf = TfidfVectorizer(analyzer='word')
tfidf_matrix = tf.fit_transform(all_lines)
query_string = "This is a short example string"
print "Query String:"
print tf.transform([query_string])
print "Example doc:"
print tf.transform(["This is a short example doc"])
输出:
Query String:
(0, 9) 0.546454011634
(0, 7) 0.546454011634
(0, 5) 0.546454011634
(0, 4) 0.32274454218
Example doc:
(0, 9) 0.479527938029
(0, 7) 0.479527938029
(0, 5) 0.479527938029
(0, 4) 0.283216924987
(0, 2) 0.479527938029
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.