简体   繁体   中英

how to get topic probability from the ldamodel by using gensim?

data1=[tokens.doc2bow(text) for text in texts]
ldamodel=gensim.models.ldamodel.LdaModel(corpus=data1,id2word=tokens,num_topics=10,random_state=100,update_every=1,chunksize=10,passes=10,alpha='auto',per_word_topics=True)
print(*ldamodel.print_topics(),sep="\n")
lda=ldamodel[data1]
l=[ldamodel.get_document_topics(item) for item in data1]
print(l)

While executing get_document_topics() , it is giving an output of hundreds of lines (as shown in picture). I don't know what does it mean. I actually want the probabilities of topics. Which method should I use to get the topic probabilities?

get_document_topics() 的输出

Those are the topic probabilitiies. Your line...

l=[ldamodel.get_document_topics(item) for item in data1]

...essentially says, "give me a list, where each entry in that list is the topic-probabilities for the same entry in data ".

So, the very first item in that returned list...

[(0, 0.974673)]

...means that your very-first document is assigned a 97.4673% chance of being in topic #0.

If you instead want the probabilities for a single document, say the document in slot 6, you'd instead run:

doc_6_topics = ldamodel.get_document_topics(data1[6])

So your existing code is already reporting the per-doc topic probabilities., If your true need is, "How do I get these into another format for another purpose?", you should edit/expand your question with more details about why the existing return value doesn't meet your needs, and what would meet your needs, and what you're trying to do next.

Separate notes:

  • It'd be better to share raw formatted text of what you're seeing, than screenshots - see some reasons here

  • It's a little concerning that of the excerpt of output shown – your early documents – they all wind up in topic #0. If in fact your training data is "clumpy", with all related documents in a row, it can be helpful to shuffle them before model training, so that documents of any particular topic might appear anywhere, instead of "all at the front" or "all at the back".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM