简体   繁体   English

用 MALLET 训练的 LDA 模型的奇怪困惑值

[英]Strange perplexity values of LDA model trained with MALLET

I have trained an LDA model with MALLET on parts of the Stack Overflow data dump and did a 70/30 split for training and test data.我已经在 Stack Overflow 数据转储的一部分上使用 MALLET 训练了一个 LDA 模型,并对训练和测试数据进行了 70/30 的拆分。

But the perplexity values are strange, because they are lower for the test set than for the training set.但是困惑度值很奇怪,因为测试集的困惑度低于训练集。 How is this possible?这怎么可能? I thought the model is better fitted for the training data?我认为该模型更适合训练数据?

I have already double checked my perplexity calculations, but I do not find an error.我已经仔细检查了我的困惑度计算,但我没有发现错误。 Do you have any idea what the reason could be?你知道可能是什么原因吗?

Thank you in advance!先感谢您!

在此处输入图片说明

Edit:编辑:

Instead of using the console output for the LL/token values of the training set, I have used the evaluator on the training set again.我没有对训练集的 LL/token 值使用控制台输出,而是再次在训练集上使用了评估器。 Now the values seem to be plausible.现在这些值似乎是合理的。

在此处输入图片说明

That makes sense.那讲得通。 The LL/token number is giving you the probability of both topic assignments and the observed words, whereas the held-out probability is giving you the marginal probability of just the observed words, summed over topics. LL/token 数为您提供了主题分配观察到的单词的概率,而保留概率为您提供了仅观察到的单词的边际概率,在主题上求和。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM