简体   繁体   中英

lda.collapsed.gibbs.sampler model and top words ranking

I have a model generated by the function lda.collapsed.gibbs.sampler , from the lda package, and i need to know the "relevance" of the top words. When using the

    top.topic.words(result$topics, 10, by.score=TRUE)

i get a list of top 10 words for each topic, but i'd like to see the percentage of the topic that those 10 words represent. I guess the information exists, because there is a "score", but I'm not really familiar with the statistical methods of the Gibbs sampler.

Thanks in advance!

I think something like this may be what you want:

for (ii in 1:nrow(result$topics)) {
  print(
    head(
      cumsum(
        sort(result$topics[ii,], decreasing=TRUE)
      ),
      n = 20
    ) / result$topic_sums[ii]
  ) 
}

Let's break it down. If you want the fraction of Gibbs assignments, then that is easy. The LDA routine returns the number of assignments to each (word, topic) pair. So all you have to do is sort each row of the result$topics to get the top words (this is essentially what top.topic.words does if you set by.score=FALSE ). Once you have it in sorted order you can just see, for each topic, how many counts occur for that word versus for the entire topic. To do that I divide by result$topic_sums which contains the total number of assignments of that topic. Finally, I use cumsum so you can see the running total weight for words in that topic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM