如何轉換數據幀中的 TfidfVectorizer() 輸出

Question

我找到了關於 model 和特定輸出的答案（ How to get top n terms with highest tf-idf score - Big sparse matrix ）。 太好了。 我想知道如何轉換 dataframe 中的指紋：

'''
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()

top_n = 3

print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                             X.sum(0).getA1())), 
                                 key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores : 
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]

print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
       key = lambda x: x[1], reverse=True)[:top_n])

# idf values: 
#  [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                         X.sum(0).getA1())),
                            key=lambda x: x[1], reverse=True)[:top_n])
'''

提前致謝！

Answer 1

以下為您提供了一個DataFrame ，其中包含 tf_idf、idf 和頻率，按 tf_idf 統計信息（降序）排序。

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

count_vectorizer = CountVectorizer(stop_words='english')
count_X = count_vectorizer.fit_transform(corpus)
count_feature_array = count_vectorizer.get_feature_names()

frequencies = (count_X.sum(0).getA1()[np.where(count_vectorizer.get_feature_names_out() == w)[0][0]] for w in vectorizer.get_feature_names_out())

df = pd.DataFrame({'word': vectorizer.get_feature_names_out(),
                   'tf_idf': X.sum(0).getA1(),
                   'idf': vectorizer.idf_,
                   'freqs': frequencies}).set_index('word').sort_values('tf_idf', ascending=False)
print(df)

# Prints:
            tf_idf       idf  freqs
word                               
document  1.473630  1.287682      2
check     0.622766  1.693147      1
like      0.622766  1.693147      1
aim       0.447214  1.693147      1
capture   0.447214  1.693147      1
corpus    0.447214  1.693147      1
key       0.447214  1.693147      1
words     0.447214  1.693147      1

如果你只想要 tf_idf 統計的前 n 個詞，你可以這樣做：

top_n = 3
print(df[:top_n])

# Prints:
            tf_idf       idf  freqs
word                               
document  1.473630  1.287682      2
check     0.622766  1.693147      1
like      0.622766  1.693147      1

如何轉換數據幀中的 TfidfVectorizer() 輸出

問題描述

1 個解決方案

解決方案1
0 已采納 2022-05-06 13:45:28

如何轉換數據幀中的 TfidfVectorizer() 輸出

問題描述

1 個解決方案

解決方案1 0 已采納 2022-05-06 13:45:28

解決方案1
0 已采納 2022-05-06 13:45:28