解释Spark MLLib LDA结果

Question

I ran LDA on spark for a set of documents and observed that the values of topicMatrix, which represents the topic distribution over terms, are more than 1 like 548.2201, 685.2436, 138.4013... What does these values mean? 我在一组文件的火花上运行了LDA，并观察到topicMatrix的值，它代表了术语的主题分布，大于1，如548.2201,685.2436,138.4013 ......这些值是什么意思？ Are these the logarithmic values of the distribution or something. 这些是分布的对数值还是其他东西。 How to convert these values to probability distribution values. 如何将这些值转换为概率分布值。 Thanks in advance. 提前致谢。

Answer 1

In both models (ie DistributedLDAModel and LocalLDAMoel ) the topicsMatrix method will, I believe, return (approximately, there's a bit of regularization due to the Dirichlet prior on topics) the expected word-topic count matrix. 在这两种模式（即DistributedLDAModel和LocalLDAMoel ）的topicsMatrix方法，我相信，回归（大约有由于之前的主题狄利克雷有点正规化的）预期字词话题数矩阵。 To check this you can take that matrix and sum up all the columns. 要检查这一点，您可以使用该矩阵并总结所有列。 The resulting vector (of length topic-count-size) should be approximately equal to the word counts (over all your documents.) In any case, to obtain the topics (probability distributions over words in your dictionary) you need to normalize the columns of the matrix returned by topicsMatrix so that each sums to 1. 生成的向量（长度为topic-count-size）应该大致等于单词count（在所有文档上）。在任何情况下，要获得主题（词典中单词的概率分布），您需要规范化列由topicsMatrix返回的矩阵，使每个总和为1。

I haven't tested it out fully, but something like this should work to normalize the columns of the matrix returned by topicsMatrix : 我没有完全测试它，但这样的东西应该用于规范化topicsMatrix返回的矩阵的列：

import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._

def normalizeColumns(m: Matrix): DenseMatrix = {
  val bm = Matrices.toBreeze(m).toDenseMatrix
  val columnSums = BDV.zeros[Double](bm.cols).t
  var i = bm.rows
  while (i > 0) { i -= 1; columnSums += bm(i, ::) }
  i = bm.cols
  while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
  new DenseMatrix(bm.rows, bm.cols, bm.data)
}

Answer 2

Normalize the columns of the matrix returned by topicsMatrix in pure scala 规范化纯scala中的topicsMatrix返回的矩阵的列

def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {

// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray

val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
  .map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))

wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap

}

https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237 https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237

解释Spark MLLib LDA结果

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-10-25 18:18:42

解决方案2
0 2017-09-18 09:14:03

解释Spark MLLib LDA结果

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-10-25 18:18:42

解决方案2 0 2017-09-18 09:14:03

解决方案1
4 已采纳 2015-10-25 18:18:42

解决方案2
0 2017-09-18 09:14:03