在scikit-learn tf-idf矩陣中獲取文檔名稱

Question

我創建了一個tf-idf矩陣，但現在我想為每個文檔檢索前2個單詞。 我想傳遞文件ID，它應該給我前2個字。

現在，我有這個樣本數據：

from sklearn.feature_extraction.text import TfidfVectorizer

d = {'doc1':"this is the first document",'doc2':"it is a sunny day"} ### corpus

test_v = TfidfVectorizer(min_df=1)    ### applied the model
t = test_v.fit_transform(d.values())
feature_names = test_v.get_feature_names() ### list of words/terms

>>> feature_names
['day', 'document', 'first', 'is', 'it', 'sunny', 'the', 'this']

>>> t.toarray()
array([[ 0.        ,  0.47107781,  0.47107781,  0.33517574,  0.        ,
     0.        ,  0.47107781,  0.47107781],
   [ 0.53404633,  0.        ,  0.        ,  0.37997836,  0.53404633,
     0.53404633,  0.        ,  0.        ]])

我可以通過給出行號來訪問矩陣，例如。

 >>> t[0,1]
   0.47107781233161794

有沒有辦法可以通過文檔ID訪問這個矩陣？ 在我的情況下'doc1'和'doc2'。

謝謝

Answer 1

通過做

t = test_v.fit_transform(d.values())

你丟失了文檔ID的任何鏈接。 沒有訂購字典，因此您不知道以哪種順序給出了哪個值。 這意味着在將值傳遞給fit_transform函數之前，您需要記錄哪個值對應於哪個id。

例如，你可以做的是：

counter = 0
values = []
key = {}


for k,v in d.items():
    values.append(v)
    key[k] = counter
    counter+=1

t = test_v.fit_transform(values)

從那里你可以構建一個函數來通過文檔ID訪問這個matix：

def get_doc_row(docid):
    rowid = key[docid]
    row = t[rowid,:]
    return row

在scikit-learn tf-idf矩陣中獲取文檔名稱

問題描述

1 個解決方案

解決方案1
7 2015-06-27 12:02:35

在scikit-learn tf-idf矩陣中獲取文檔名稱

問題描述

1 個解決方案

解決方案1 7 2015-06-27 12:02:35

解決方案1
7 2015-06-27 12:02:35