[英]Interpretation of Spark MLLib LDA results
I ran LDA on spark for a set of documents and observed that the values of topicMatrix, which represents the topic distribution over terms, are more than 1 like 548.2201, 685.2436, 138.4013... What does these values mean? 我在一组文件的火花上运行了LDA,并观察到topicMatrix的值,它代表了术语的主题分布,大于1,如548.2201,685.2436,138.4013 ......这些值是什么意思? Are these the logarithmic values of the distribution or something.
这些是分布的对数值还是其他东西。 How to convert these values to probability distribution values.
如何将这些值转换为概率分布值。 Thanks in advance.
提前致谢。
In both models (ie DistributedLDAModel
and LocalLDAMoel
) the topicsMatrix
method will, I believe, return (approximately, there's a bit of regularization due to the Dirichlet prior on topics) the expected word-topic count matrix. 在这两种模式(即
DistributedLDAModel
和LocalLDAMoel
)的topicsMatrix
方法,我相信,回归(大约有由于之前的主题狄利克雷有点正规化的)预期字词话题数矩阵。 To check this you can take that matrix and sum up all the columns. 要检查这一点,您可以使用该矩阵并总结所有列。 The resulting vector (of length topic-count-size) should be approximately equal to the word counts (over all your documents.) In any case, to obtain the topics (probability distributions over words in your dictionary) you need to normalize the columns of the matrix returned by
topicsMatrix
so that each sums to 1. 生成的向量(长度为topic-count-size)应该大致等于单词count(在所有文档上)。在任何情况下,要获得主题(词典中单词的概率分布),您需要规范化列由
topicsMatrix
返回的矩阵,使每个总和为1。
I haven't tested it out fully, but something like this should work to normalize the columns of the matrix returned by topicsMatrix
: 我没有完全测试它,但这样的东西应该用于规范化
topicsMatrix
返回的矩阵的列:
import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._
def normalizeColumns(m: Matrix): DenseMatrix = {
val bm = Matrices.toBreeze(m).toDenseMatrix
val columnSums = BDV.zeros[Double](bm.cols).t
var i = bm.rows
while (i > 0) { i -= 1; columnSums += bm(i, ::) }
i = bm.cols
while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
new DenseMatrix(bm.rows, bm.cols, bm.data)
}
Normalize the columns of the matrix returned by topicsMatrix in pure scala 规范化纯scala中的topicsMatrix返回的矩阵的列
def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {
// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray
val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
.map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))
wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap
}
https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237 https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.