简体   繁体   English

解释Spark MLLib LDA结果

[英]Interpretation of Spark MLLib LDA results

I ran LDA on spark for a set of documents and observed that the values of topicMatrix, which represents the topic distribution over terms, are more than 1 like 548.2201, 685.2436, 138.4013... What does these values mean? 我在一组文件的火花上运行了LDA,并观察到topicMatrix的值,它代表了术语的主题分布,大于1,如548.2201,685.2436,138.4013 ......这些值是什么意思? Are these the logarithmic values of the distribution or something. 这些是分布的对数值还是其他东西。 How to convert these values to probability distribution values. 如何将这些值转换为概率分布值。 Thanks in advance. 提前致谢。

In both models (ie DistributedLDAModel and LocalLDAMoel ) the topicsMatrix method will, I believe, return (approximately, there's a bit of regularization due to the Dirichlet prior on topics) the expected word-topic count matrix. 在这两种模式(即DistributedLDAModelLocalLDAMoel )的topicsMatrix方法,我相信,回归(大约有由于之前的主题狄利克雷有点正规化的)预期字词话题数矩阵。 To check this you can take that matrix and sum up all the columns. 要检查这一点,您可以使用该矩阵并总结所有列。 The resulting vector (of length topic-count-size) should be approximately equal to the word counts (over all your documents.) In any case, to obtain the topics (probability distributions over words in your dictionary) you need to normalize the columns of the matrix returned by topicsMatrix so that each sums to 1. 生成的向量(长度为topic-count-size)应该大致等于单词count(在所有文档上)。在任何情况下,要获得主题(词典中单词的概率分布),您需要规范化列由topicsMatrix返回的矩阵,使每个总和为1。

I haven't tested it out fully, but something like this should work to normalize the columns of the matrix returned by topicsMatrix : 我没有完全测试它,但这样的东西应该用于规范化topicsMatrix返回的矩阵的列:

import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._

def normalizeColumns(m: Matrix): DenseMatrix = {
  val bm = Matrices.toBreeze(m).toDenseMatrix
  val columnSums = BDV.zeros[Double](bm.cols).t
  var i = bm.rows
  while (i > 0) { i -= 1; columnSums += bm(i, ::) }
  i = bm.cols
  while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
  new DenseMatrix(bm.rows, bm.cols, bm.data)
} 

Normalize the columns of the matrix returned by topicsMatrix in pure scala 规范化纯scala中的topicsMatrix返回的矩阵的列

def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {

// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray

val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
  .map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))

wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap

}

https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237 https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法在Spark中导入名称LDA MLlib - Cannot import name LDA MLlib in Spark Spark 1.5 MlLib LDA - 获取新文档的主题分布 - Spark 1.5 MlLib LDA - getting topic distribusions for new documents 为什么在Spark mllib中报告LDA模型的日志复杂性如此缓慢? - Why is reporting the log perplexity of an LDA model so slow in Spark mllib? Spark MLlib LDA,如何推断新的未见文档的主题分布? - Spark MLlib LDA, how to infer the topics distribution of a new unseen document? 针对不兼容版本编译的Spark MLLib异常LDA.class - Spark MLLib exception LDA.class compiled against incompatible version Spark MLlib LDA:生成始终非常相似的LDA主题背后的可能原因? - Spark MLlib LDA: the possible reasons behind generating always very similar LDA topics? Spark Mllib .toBlockMatrix的矩阵值为0.0 - Spark Mllib .toBlockMatrix results in matrix of 0.0 “主要” java.lang.ClassCastException:[Lscala.Tuple2; 无法在Spark MLlib LDA中强制转换为scala.Tuple2 - “main” java.lang.ClassCastException: [Lscala.Tuple2; cannot be cast to scala.Tuple2 in Spark MLlib LDA 无法在 Spark 2.0 中的数据集 [(scala.Long, org.apache.spark.mllib.linalg.Vector)] 上运行 LDA - Can't run LDA on Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)] in Spark 2.0 Spark MlLib线性回归(线性最小二乘)给出随机结果 - Spark MlLib linear regression (Linear least squares) giving random results
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM