你如何獲得 tfidf.get_feature_names_out() 生成的術語的頻率

Question

使用 tfidf 擬合后，我正在查看生成的特征：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

但我也想獲得每個術語的頻率

Answer 1

“計算特定單詞出現的句子數”的一種方法是使用sklearn.feature_extraction.text.CountVectorizer 。

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

from sklearn.feature_extraction.text import CountVectorizer

# since we're counting sentences and not words, use binary=True
cv = CountVectorizer(binary=True)

X = cv.fit_transform(corpus)

print(cv.vocabulary_)  # all the words in the corpus with their column index
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

# show occurrences (not count) of vocabulary words in sentences (each line/row) in corpus
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
#  [0 1 0 1 0 1 1 0 1]
#  [1 0 0 1 1 0 1 1 1]
#  [0 1 1 1 0 0 1 0 1]]

# So, for example the word "this" is at column index 8 in the matrix above

# How many sentences in the corpus have the word "this"?
print(sum(X[:,cv.vocabulary_["this"]])[0,0])
# 4

# How many sentences in the corpus have the word "document"?
print(sum(X[:,cv.vocabulary_["document"]])[0,0])
# 3

你如何獲得 tfidf.get_feature_names_out() 生成的術語的頻率

問題描述

1 個解決方案

解決方案1
0 2022-11-29 19:07:34

你如何獲得 tfidf.get_feature_names_out() 生成的術語的頻率

問題描述

1 個解決方案

解決方案1 0 2022-11-29 19:07:34

解決方案1
0 2022-11-29 19:07:34