为什么在Spark mllib中报告LDA模型的日志复杂性如此缓慢？

Question

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. 我使用OnlineLDAOptimizer在Spark mllib中拟合LDA模型。 It only takes ~200 seconds to fit 10 topics on 9M documents (tweets). 仅需200秒钟就可以将9个文档（推文）上的10个主题包含在内。

val numTopics=10
val lda = new LDA()
  .setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
  .setK(numTopics)
  .setMaxIterations(2)
  .setDocConcentration(-1) // use default symmetric document-topic prior
  .setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)

/**
 * Print results
 */
// Print training time
println(s"Finished training LDA model.  Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")

numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@351e2b4c
Finished training LDA model.  Summary:
Training time (sec) 202.640775542

However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. 但是，当我请求此模型的日志复杂性时（看起来我首先需要将其强制转换回LocalLDAModel），这需要很长时间才能评估。 Why? 为什么？ (I'm trying to get the log perplexity out so I can optimize k, the # of topics). （我试图消除日志的困惑，因此我可以优化k，即主题数）。

ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.

Answer 1

In general, calculating the perplexity is not a straightforward matter: https://stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation 通常，计算困惑不是一件容易的事： https : //stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation
Also setting the number of topics by only looking at perplexity might not be the right approach: https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus 另外，仅通过查看困惑来设置主题数可能不是正确的方法： https ： //www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running- LDA-ON-A-语料库

LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. 无论如何，通过在线优化器学习的LDAModel都是LocalLDAModel类型，因此不会发生转换。 I calculated perplexity on both, local and distributed: they take quite some time. 我计算了本地和分布式的困惑：它们需要花费一些时间。 I mean looking at the code, they have nested map calls on the whole Dataset. 我的意思是看代码，他们在整个数据集上嵌套了地图调用。

Calling: 呼叫：

docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)

for (9M * nonzero BOW entries) times can take quite some time. （9M *非零BOW项）时间可能要花费一些时间。 The Code is from: https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312 该代码来自： https : //github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala第312行

Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls. 在您的情况下，训练LDA很快，因为使用9m / mbf更新调用仅训练2次迭代。

Btw. 顺便说一句。 the default for docConcentration is Vectors.dense(-1) and not just an Int. docConcentration的默认值是Vectors.dense（-1），而不仅仅是Int。

Btw. 顺便说一句。 number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble. 数字2：由于这个问题，我的算法在群集上运行时遇到了麻烦，只是因为我有这种愚蠢的困惑度计算，并且不知道它会造成多少麻烦。

为什么在Spark mllib中报告LDA模型的日志复杂性如此缓慢？

问题描述

1 个解决方案

解决方案1
1 2016-06-02 14:33:57

为什么在Spark mllib中报告LDA模型的日志复杂性如此缓慢？

问题描述

1 个解决方案

解决方案1 1 2016-06-02 14:33:57

解决方案1
1 2016-06-02 14:33:57