简体   繁体   English

如何获得所有文档的主导主题?

[英]How can I get the dominant topic for all documents?

I created a lda model that identifies 15 topics.我创建了一个标识 15 个主题的 lda model。 When I run the code to get the dominant topic for all the documents it gives me 10 topics instead of 15. How can I get the dominant topic for all documents based on the 15 topics of the lda model?当我运行代码以获取所有文档的主要主题时,它给了我 10 个主题而不是 15 个。如何根据 lda model 的 15 个主题获取所有文档的主要主题?

LDA model
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                               id2word=id2word,
                                               num_topics=15,
                                               random_state=100,
                                               update_every=1,
                                               chunksize=100,
                                               passes=20,
                                               alpha="auto",
                                               per_word_topics=True)

Code to find the dominant topic for all documents:查找所有文档的主要主题的代码:

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents,df1, df2], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text', 'id', 'datum']
#df_dominant_topic.head(20)

#save
df_dominant_topic.to_csv('data/dominant_topic.csv', sep=',')

Have you tried either initializing the model with minimum_probability=0.0 , or explicitly calling get_document_topics() (the method on which […] -indexing relies) with a minimum_probability=0.0 , so that your topic results aren't clipped to just those with a larger probability than the default minimum_probability=0.01 ?您是否尝试过使用minimum_probability=0.0初始化 model ,或者使用 minimum_probability= minimum_probability=0.0显式调用get_document_topics()[…] -indexing 所依赖的方法),这样您的主题结果就不会被裁剪为仅具有比默认minimum_probability=0.01更大的概率?

Note that show_topic() also has a default parameter topn=10 which will only display the top 10 related words, unless you supply a larger value.请注意, show_topic()也有一个默认参数topn=10 ,它将只显示前 10 个相关单词,除非您提供更大的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM