[英]Why is reporting the log perplexity of an LDA model so slow in Spark mllib?
I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. 我使用OnlineLDAOptimizer在Spark mllib中拟合LDA模型。 It only takes ~200 seconds to fit 10 topics on 9M documents (tweets).
仅需200秒钟就可以将9个文档(推文)上的10个主题包含在内。
val numTopics=10
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
.setK(numTopics)
.setMaxIterations(2)
.setDocConcentration(-1) // use default symmetric document-topic prior
.setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)
/**
* Print results
*/
// Print training time
println(s"Finished training LDA model. Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")
numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@351e2b4c
Finished training LDA model. Summary:
Training time (sec) 202.640775542
However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. 但是,当我请求此模型的日志复杂性时(看起来我首先需要将其强制转换回LocalLDAModel),这需要很长时间才能评估。 Why?
为什么? (I'm trying to get the log perplexity out so I can optimize k, the # of topics).
(我试图消除日志的困惑,因此我可以优化k,即主题数)。
ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.
In general, calculating the perplexity is not a straightforward matter: https://stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation 通常,计算困惑不是一件容易的事: https : //stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation
Also setting the number of topics by only looking at perplexity might not be the right approach: https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus 另外,仅通过查看困惑来设置主题数可能不是正确的方法: https : //www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running- LDA-ON-A-语料库
LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. 无论如何,通过在线优化器学习的LDAModel都是LocalLDAModel类型,因此不会发生转换。 I calculated perplexity on both, local and distributed: they take quite some time.
我计算了本地和分布式的困惑:它们需要花费一些时间。 I mean looking at the code, they have nested map calls on the whole Dataset.
我的意思是看代码,他们在整个数据集上嵌套了地图调用。
Calling: 呼叫:
docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)
for (9M * nonzero BOW entries) times can take quite some time. (9M *非零BOW项)时间可能要花费一些时间。 The Code is from: https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312
该代码来自: https : //github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala第312行
Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls. 在您的情况下,训练LDA很快,因为使用9m / mbf更新调用仅训练2次迭代。
Btw. 顺便说一句。 the default for docConcentration is Vectors.dense(-1) and not just an Int.
docConcentration的默认值是Vectors.dense(-1),而不仅仅是Int。
Btw. 顺便说一句。 number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble.
数字2:由于这个问题,我的算法在群集上运行时遇到了麻烦,只是因为我有这种愚蠢的困惑度计算,并且不知道它会造成多少麻烦。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.