簡體   English   中英

你如何獲得 tfidf.get_feature_names_out() 生成的術語的頻率

[英]how do you get the frequency of the terms generated by tfidf.get_feature_names_out()

使用 tfidf 擬合后,我正在查看生成的特征:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

但我也想獲得每個術語的頻率

“計算特定單詞出現的句子數”的一種方法是使用sklearn.feature_extraction.text.CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

from sklearn.feature_extraction.text import CountVectorizer

# since we're counting sentences and not words, use binary=True
cv = CountVectorizer(binary=True)

X = cv.fit_transform(corpus)

print(cv.vocabulary_)  # all the words in the corpus with their column index
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

# show occurrences (not count) of vocabulary words in sentences (each line/row) in corpus
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
#  [0 1 0 1 0 1 1 0 1]
#  [1 0 0 1 1 0 1 1 1]
#  [0 1 1 1 0 0 1 0 1]]

# So, for example the word "this" is at column index 8 in the matrix above

# How many sentences in the corpus have the word "this"?
print(sum(X[:,cv.vocabulary_["this"]])[0,0])
# 4

# How many sentences in the corpus have the word "document"?
print(sum(X[:,cv.vocabulary_["document"]])[0,0])
# 3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM