简体   繁体   English

为什么在Spark mllib中报告LDA模型的日志复杂性如此缓慢?

[英]Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. 我使用OnlineLDAOptimizer在Spark mllib中拟合LDA模型。 It only takes ~200 seconds to fit 10 topics on 9M documents (tweets). 仅需200秒钟就可以将9个文档(推文)上的10个主题包含在内。

val numTopics=10
val lda = new LDA()
  .setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
  .setK(numTopics)
  .setMaxIterations(2)
  .setDocConcentration(-1) // use default symmetric document-topic prior
  .setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)

/**
 * Print results
 */
// Print training time
println(s"Finished training LDA model.  Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")

numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@351e2b4c
Finished training LDA model.  Summary:
Training time (sec) 202.640775542

However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. 但是,当我请求此模型的日志复杂性时(看起来我首先需要将其强制转换回LocalLDAModel),这需要很长时间才能评估。 Why? 为什么? (I'm trying to get the log perplexity out so I can optimize k, the # of topics). (我试图消除日志的困惑,因此我可以优化k,即主题数)。

ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.

LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. 无论如何,通过在线优化器学习的LDAModel都是LocalLDAModel类型,因此不会发生转换。 I calculated perplexity on both, local and distributed: they take quite some time. 我计算了本地和分布式的困惑:它们需要花费一些时间。 I mean looking at the code, they have nested map calls on the whole Dataset. 我的意思是看代码,他们在整个数据集上嵌套了地图调用。

Calling: 呼叫:

docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)

for (9M * nonzero BOW entries) times can take quite some time. (9M *非零BOW项)时间可能要花费一些时间。 The Code is from: https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312 该代码来自: https : //github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala第312行

Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls. 在您的情况下,训练LDA很快,因为使用9m / mbf更新调用仅训练2次迭代。

Btw. 顺便说一句。 the default for docConcentration is Vectors.dense(-1) and not just an Int. docConcentration的默认值是Vectors.dense(-1),而不仅仅是Int。

Btw. 顺便说一句。 number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble. 数字2:由于这个问题,我的算法在群集上运行时遇到了麻烦,只是因为我有这种愚蠢的困惑度计算,并且不知道它会造成多少麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM