简体   繁体   中英

Topic modeling - run LDA in sklearn : how to compute the Wordcloud?

I trained my LDA model in sklearn to build the topic model, but have no idea about how to compute the key-word Wordcloud for each of the obtained topics?

Here is my LDA model:

vectorizer = CountVectorizer(analyzer='word',       
                         min_df=3,                        
                         max_df=6000,
                         stop_words='english',             
                         lowercase=False,                   
                         token_pattern ='[a-zA-Z0-9]{3,}' 
                         max_features=50000,             
                        )
data_vectorized = vectorizer.fit_transform(data_lemmatized) # data_lemmatized is all my processed document text

best_lda_model = LatentDirichletAllocation(batch_size=128, doc_topic_prior=0.1,
                      evaluate_every=-1, learning_decay=0.7,
                      learning_method='online', learning_offset=10.0,
                      max_doc_update_iter=100, max_iter=10,
                      mean_change_tol=0.001, n_components=10, n_jobs=None,
                      perp_tol=0.1, random_state=None, topic_word_prior=0.1,
                      total_samples=1000000.0, verbose=0)

lda_output = best_lda_model.transform(data_vectorized)

I know that best_lda_model.components_ gives the topic word weights... vectorizer.get_feature_names() gives all words from vocabulary in each topic...

Many thanks in advance!

You have to iterate through model 'components_', which have size [n_components, n_features], so the first dimension contains topics and the second the scores for each word in the vocabulary. So you first need to find the indices of the most relevant words for the topics, and then by using the 'vocab' dictionary, defined using get_features_names(), you can retrieve the words.

import numpy as np

# define vocabulary to get words names 
vocab = vectorizer.get_feature_names()

# dictionary to store words for each topic and number of words per topic to retrive 
words = {}
n_top_words = 10

for topic, component in enumerate(model.components_):

    # need [::-1] to sort the array in descending order
    indices = np.argsort(component)[::-1][:n_top_words]

    # store the words most relevant to the topic
    words[topic] = [vocab[i] for i in indices]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM