简体   繁体   中英

In NLP using tf-idf how to find the frequency of specific word from the corpus(contaning large numbers of documentation) in python

How to find the frequency of an individual word from the corpus using Tf-idf. Below is my sample code, now I want to print the frequency of a word. How can I achieve this?

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['This is the first document.',
      'This is the second second document.',
      'And the third one.',
      'Is this the first document?',]
X = vectorizer.fit_transform(corpus)
X
print(vectorizer.get_feature_names())
X.toarray()
vectorizer.vocabulary_.get('document')

print(vectorizer.get_feature_names())

X.toarray()

vectorizer.vocabulary_.get('document')

Your vectorizer.vocabulary_ has the count for each word:

print(vectorizer.volcabulary_)

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

Calculating word frequency is straightforward then:

vocab = vectorizer.vocabulary_
tot = sum(vocab.values())
frequency = {vocab[w]/tot for w in vocab.keys()}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM