[英]TF-IDF Matrix In Python
My code to calculate TF-IDF
for a corpus goes like this: 我的计算语料库
TF-IDF
代码如下:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
train_set = "i have a ball", "he is good", "she played well"
vectorizer = TfidfVectorizer(min_df=1)
train_array = vectorizer.fit_transform(train_set).toarray()
print(vectorizer.get_feature_names())
print(train_array)
The output I receive is: 我收到的输出是:
['ball', 'good', 'have', 'he', 'is', 'played', 'she', 'well']
[[0.70710678, 0., 0.70710678, 0., 0., 0., 0., 0.],
[0., 0.57735027, 0., 0.57735027, 0.57735027, 0., 0., 0.],
[0., 0., 0., 0., 0., 0.57735027, 0.57735027, 0.57735027]]
The question is how can I calculate TF-IDF
of the sentence: "she is good"
? 问题是如何计算句子
"she is good"
TF-IDF
? The corpus is the train_set
in the above code. 语料库是上面代码中的
train_set
。
You simply apply your TF-IDF
vectorizer on new data with .transform
method: 您只需使用
.transform
方法将TF-IDF
矢量化器应用于新数据:
In [16]: test = ["she is good"]
In [17]: test_array = vectorizer.transform(test)
In [18]: test_array.A
Out[18]: array([[0., 0.57735027, 0., 0., 0.57735027, 0., 0.57735027, 0.]])
In [19]: vectorizer.get_feature_names()
Out[19]: ['ball', 'good', 'have', 'he', 'is', 'played', 'she', 'well']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.