简体繁体 English

Latent Dirichlet Allocation主题数量未知

[英]Latent Dirichlet Allocation where number of topics is unknown

原文 2012-10-12 04:58:06 1 2 statistics/ machine-learning/ artificial-intelligence

我正在寻找一种类似于LDA的技术，但不知道有多少“混合物”是最佳的 - 有什么可以做到的吗？

2 个解决方案

There are two ways of going about this, one hacky but easy; 有两种方法可以解决这个问题，一种是hacky但很容易; the other better motivated but more complex. 另一个更好的动机，但更复杂。 Starting with the former, one could simply try a range of k (number of topics) and compare the likelihoods of the observed data under each of these. 从前者开始，人们可以简单地尝试一系列k （主题数）并比较每种情况下观察到的数据的可能性。 You would probably want to penalize for larger number of topics, depending on your situation -- or you could explicitly place a prior distribution over k (ie, a normal centered about the subjectively expected number of clusters). 您可能希望根据您的情况惩罚更多的主题 - 或者您可以明确地将先前分布置于k上（即，以主观预期的群集数量为中心的正常分布）。 In any case you would simply select the k that maximizes the likelihood. 在任何情况下，您只需选择最大化可能性的k 。

The more principled approach is to use Bayesian nonparametrics, and Dirichlet processes in particular in the case of topic models. 更有原则的方法是使用贝叶斯非参数和Dirichlet过程，特别是在主题模型的情况下。 Have a look at this paper . 看看这篇论文。 I do believe there is an implementation available here , though I haven't much looked into it. 我不相信这是一个实现可用在这里，虽然我还没有太多的看着它。

As Byron said, the simplest way to do this is to compare likelihoods for different values of k. 正如拜伦所说，最简单的方法是比较不同k值的可能性。 However, if you take care to consider the probability of some held-out data (ie not used to induce the model), this naturally penalises overfitting and so you don't need to normalise for k. 但是，如果你注意考虑一些保持数据的概率（即不用于诱导模型），这自然会对过度拟合进行惩罚，因此你不需要对k进行标准化。 A simple way to do this is to take your training data and split it into a training set and a dev set, and do a search over a range of plausible k values, inducing models from the training set and then computing dev set probability given the induced model. 一个简单的方法是获取训练数据并将其分成训练集和开发集，然后搜索一系列合理的k值，从训练集中引入模型，然后计算开发集概率给出诱导模型。

It's worth mentioning that computing the likelihood exactly under LDA is intractable, so you're going to need to use approximate inference. 值得一提的是，在LDA下精确计算可能性是难以处理的，因此您需要使用近似推理。 This paper goes into this in depth, but if you use a standard LDA package (I'd recommend mallet: http://mallet.cs.umass.edu/ ) they should have this functionality already. 本文深入讨论了这个问题，但是如果你使用标准的LDA软件包（我推荐使用mallet： http ： //mallet.cs.umass.edu/ ），他们应该已经有了这个功能。

The non-parametric version is indeed the correct way to go, but inference in non-parametric models is computationally expensive, so I would hesitate to pursue this unless the above doesn't work. 非参数版本确实是正确的方法，但非参数模型中的推理在计算上是昂贵的，所以除非上述方法不起作用，否则我会犹豫不决。