来自R lda包的lda.collapsed.gibbs.sampler命令的输出

Question

I don't understand this part of output from lda.collapsed.gibbs.sampler command. 我不了解lda.collapsed.gibbs.sampler命令的输出内容。 What I don't understand is why the numbers of the same word in different topics are different? 我不明白的是，为什么相同主题在不同主题中的数字不同？ For example, why for the word "test" there is 4 of them in second topics when topic 8 get 37 of them. 例如，为什么在主题8中有37个主题时，第二个主题中有4个单词用于“测试”。 Shouldn't number of same word in different topic be the same integer or 0? 不同主题中相同单词的数目不应该是相同的整数或0吗？

Or Do I misunderstood something and these numbers don't stand for number of word in the topic? 还是我误解了一些，这些数字不代表主题中的单词数？

$topics
      tests-loc fail  test testmultisendcookieget
 [1,]         0    0     0                      0
 [2,]         0    0     4                      0
 [3,]         0    0     0                      0
 [4,]         0    1     0                      0
 [5,]         0    0     0                      0
 [6,]         0    0     0                      0
 [7,]         0    0     0                      0
 [8,]         0    0    37                      0
 [9,]         0    0     0                      0
[10,]         0    0     0                      0
[11,]         0    0     0                      0
[12,]         0    2     0                      0
[13,]         0    0     0                      0
[14,]         0    0     0                      0
[15,]         0    0     0                      0
[16,]         0    0     0                      0
[17,]         0    0     0                      0
[18,]         0    0     0                      0
[19,]         0    0     0                      0
[20,]         0    0     0                      0
[21,]         0    0     0                      0
[22,]         0  361  1000                      0
[23,]         0    0     0                      0
[24,]         0    0     0                      0
[25,]         0    0     0                      0
[26,]         0    0     0                      0
[27,]         0    0     0                      0
[28,]         0 1904 12617                      0
[29,]         0    0     0                      0
[30,]         0    0     0                      0
[31,]         0    0     0                      0
[32,]         0 1255  3158                      0
[33,]         0    0     0                      0
[34,]         0    0     0                      0
[35,]         0    0     0                      0
[36,]         1    0     0                      1
[37,]         0    1     0                      0
[38,]         0    0     0                      0
[39,]         0    0     0                      0
[40,]         0    0     0                      0
[41,]         0    0     0                      0
[42,]         0    0     0                      0
[43,]         0    0     0                      0
[44,]         0    0     0                      0
[45,]         0    2     0                      0
[46,]         0    0     0                      0
[47,]         0    0     0                      0
[48,]         0    0     4                      0
[49,]         0    0     0                      0
[50,]         0    1     0                      0

Here is the code that I run. 这是我运行的代码。

library(lda)
data=read.documents(filename = "data.ldac")
vocab=read.vocab(filename = "words.csv")

K=100
num.iterations=100
alpha=1
eta=1


result = lda.collapsed.gibbs.sampler(data, K,vocab, num.iterations, alpha,eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE,trace = 0L, freeze.topics = FALSE)

options(max.print=100000000) 
result

PS. PS。 Sorry for the long post and my bad english. 抱歉，我的帖子太长了，我的英语不好。

Answer 1

The topic distributions in LDA are just that: multinomial distributions. LDA中的主题分布就是：多项分布。 These correspond to the rows of the matrix you have above. 这些对应于您上面的矩阵行。 The probability of seeing a word in any given topic is not constrained to be a fixed value (or zero) for any of the topics. 对于任何主题，在任何给定主题中看到单词的概率均不被限制为固定值（或零）。 That is, the word 'test' can have a 3% chance of occurring in one topic, a 1% chance of occurring in another. 也就是说，“测试”一词在一个主题中发生的可能性为3％，在另一主题中发生的可能性为1％。

nb If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. nb如果要将矩阵转换为概率，只需对行进行归一化并添加先前的平滑常数即可。 The function here just returns the raw number of assignments in the last Gibbs sampling sweep. 这里的函数仅返回上一次Gibbs采样扫描中的原始分配数量。

来自R lda包的lda.collapsed.gibbs.sampler命令的输出

问题描述

1 个解决方案

解决方案1
3 2014-01-24 06:33:30

来自R lda包的lda.collapsed.gibbs.sampler命令的输出

问题描述

1 个解决方案

解决方案1 3 2014-01-24 06:33:30

解决方案1
3 2014-01-24 06:33:30