[英]Get the document name in scikit-learn tf-idf matrix
I have created a tf-idf matrix but now I want to retrieve top 2 words for each document. 我创建了一个tf-idf矩阵,但现在我想为每个文档检索前2个单词。 I want to pass document id and it should give me the top 2 words.
我想传递文件ID,它应该给我前2个字。
Right now, I have this sample data: 现在,我有这个样本数据:
from sklearn.feature_extraction.text import TfidfVectorizer
d = {'doc1':"this is the first document",'doc2':"it is a sunny day"} ### corpus
test_v = TfidfVectorizer(min_df=1) ### applied the model
t = test_v.fit_transform(d.values())
feature_names = test_v.get_feature_names() ### list of words/terms
>>> feature_names
['day', 'document', 'first', 'is', 'it', 'sunny', 'the', 'this']
>>> t.toarray()
array([[ 0. , 0.47107781, 0.47107781, 0.33517574, 0. ,
0. , 0.47107781, 0.47107781],
[ 0.53404633, 0. , 0. , 0.37997836, 0.53404633,
0.53404633, 0. , 0. ]])
I can access the matrix by giving the row number eg. 我可以通过给出行号来访问矩阵,例如。
>>> t[0,1]
0.47107781233161794
Is there a way I can be able to access this matrix by document id? 有没有办法可以通过文档ID访问这个矩阵? In my case 'doc1' and 'doc2'.
在我的情况下'doc1'和'doc2'。
Thanks 谢谢
By doing 通过做
t = test_v.fit_transform(d.values())
you lose any link to the document ids. 你丢失了文档ID的任何链接。 A dict is not ordered so you have no idea which value is given in which order.
没有订购字典,因此您不知道以哪种顺序给出了哪个值。 The means that before passing the values to the fit_transform function you need to record which value corresponds to which id.
这意味着在将值传递给fit_transform函数之前,您需要记录哪个值对应于哪个id。
For example what you can do is: 例如,你可以做的是:
counter = 0
values = []
key = {}
for k,v in d.items():
values.append(v)
key[k] = counter
counter+=1
t = test_v.fit_transform(values)
From there you can build a function to access this matix by document id: 从那里你可以构建一个函数来通过文档ID访问这个matix:
def get_doc_row(docid):
rowid = key[docid]
row = t[rowid,:]
return row
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.