[英]How can I get the dominant topic for all documents?
I created a lda model that identifies 15 topics.我创建了一个标识 15 个主题的 lda model。 When I run the code to get the dominant topic for all the documents it gives me 10 topics instead of 15. How can I get the dominant topic for all documents based on the 15 topics of the lda model?
当我运行代码以获取所有文档的主要主题时,它给了我 10 个主题而不是 15 个。如何根据 lda model 的 15 个主题获取所有文档的主要主题?
LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=15,
random_state=100,
update_every=1,
chunksize=100,
passes=20,
alpha="auto",
per_word_topics=True)
Code to find the dominant topic for all documents:查找所有文档的主要主题的代码:
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents,df1, df2], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text', 'id', 'datum']
#df_dominant_topic.head(20)
#save
df_dominant_topic.to_csv('data/dominant_topic.csv', sep=',')
Have you tried either initializing the model with minimum_probability=0.0
, or explicitly calling get_document_topics()
(the method on which […]
-indexing relies) with a minimum_probability=0.0
, so that your topic results aren't clipped to just those with a larger probability than the default minimum_probability=0.01
?您是否尝试过使用
minimum_probability=0.0
初始化 model ,或者使用 minimum_probability= minimum_probability=0.0
显式调用get_document_topics()
( […]
-indexing 所依赖的方法),这样您的主题结果就不会被裁剪为仅具有比默认minimum_probability=0.01
更大的概率?
Note that show_topic()
also has a default parameter topn=10
which will only display the top 10 related words, unless you supply a larger value.请注意,
show_topic()
也有一个默认参数topn=10
,它将只显示前 10 个相关单词,除非您提供更大的值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.