简体   繁体   English

主题和潜在Dirichlet分配

[英]Topics and Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative model which produces a list of topics. 潜在狄利克雷分配(LDA)是一个生成模型,可产生主题列表。 Each topic is represented by a distribution over words. 每个主题都由单词分布表示。 Assume each topic is represented by its top 40 words. 假设每个主题都由其前40个字表示。

Given a new document how can I determine which topics made up this new document with out needing to re-run lda again. 给定一个新文档,我如何确定组成该新文档的主题,而无需再次重新运行lda。 In other words, how can I use the estimated topics to infer the topic of a new unseen document. 换句话说,我该如何使用估计的主题来推断一个新的看不见的文档的主题。

Update: 更新:

for estimation we do the following (i ignored the hyperparameters for simplicity) 为了进行估计,我们执行以下操作(为简单起见,我忽略了超参数)

for(int iter=0;iter<1000;iter++){
  for(int token=0;token<numTokens;token++){
     double[] values=new double[numTopics];
     double pz=0.0d;
    for(int topic=0;topic<numTopics;topic++){
         probabilityOfWordGivenTopic= topicCount[topic][word[token]] / numTokenInTopic[topic];
         probabilityOfTopicGivenDocument= docCount[doc[token]][topic] / docLength[doc[token]];
         pz= probabilityOfWordGivenTopic * probabilityOfTopicGivenDocument;
         values[topic]=pz;
         total+=pz;
   }
 }
}

Thank you. 谢谢。

In the inference step, you basically need to assign topics to words of the new documents. 在推断步骤中,您基本上需要为新文档的单词分配主题。

For seen words, use your estimated model to assign the probabilities. 对于可见单词,请使用您的估计模型来分配概率。 Taking your example, since you have 40 topics, you already have learnt the word-topic distribution (the phi matrix) during the LDA estimation. 以您的示例为例,由于您有40个主题,因此您在LDA估计过程中已经学习了单词主题分布(phi矩阵)。 Now, for a word seen during training, say w, take the w-th column vector of this matrix, which is of size 40. This vector gives you the class membership probabilities of word w into each topic. 现在,对于在训练中看到的单词,说w,取该矩阵的第w列向量,该向量的大小为40。该向量为您提供了每个主题中单词w的类成员资格概率。 Say for example this vector is (.02, .01, .... .004), which means that P(w|t_1)=.02, and so on. 举例来说,此向量为(.02,.01,..... 004),这意味着P(w | t_1)=。02,依此类推。

In the new document, wherever you see this word w, draw a sample from this distribution and assign a topic to it. 在新文档中,无论您在何处看到w,都从该分布中抽取一个示例,并为其分配一个主题。 Clearly, it's more likely for this word w to be assigned to its true (technically speaking, estimated) topic class learnt from the estimation process. 显然,很可能将此单词w分配给从估计过程中学到的真实(从技术上来说是估计的)主题类。

For OOV words (ie words which you haven't seen during the training), one common practice is to use a uniform distribution, ie in your example use the probability of 1/40 to assign topics to it. 对于OOV单词(即您在培训期间没有看到的单词),一种常见的做法是使用均匀分布,即在您的示例中,使用1/40的概率为其分配主题。

Edit 编辑

A code snippet extracted from JGibbsLDA follows: JGibbsLDA提取的代码段如下:

        for (int m = 0; m < newModel.M; ++m){
            for (int n = 0; n < newModel.data.docs[m].length; n++){
                // (newz_i = newz[m][n]
                // sample from p(z_i|z_-1,w)
                int topic = infSampling(m, n);
                newModel.z[m].set(n, topic);
            }
        }//end foreach new doc

The main step in the inferencing sampling is to assign the probabilities for a word w. 推理抽样的主要步骤是为单词w分配概率。 Note that this probability depends partly on the estimated model probability (trnModel.nw[w][k] of the code) and partly on the new assignment probabilities (newModel.nw[_w][k]). 注意,该概率部分取决于估计的模型概率(代码的trnModel.nw [w] [k]),部分取决于新的分配概率(newModel.nw [_w] [k])。 For OOV words, trnModel.nw[w][k] is set to 1/K. 对于OOV字,将trnModel.nw [w] [k]设置为1 / K。 This probability doesn't depend on P(w|d). 此概率不依赖于P(w | d)。 Instead P(w|d) is just a posterior probability computed after the topic assignments are done via Gibbs sampling. 取而代之的是,P(w | d)只是通过吉布斯采样完成主题分配后计算出的后验概率。

    // do multinomial sampling via cummulative method
    for (int k = 0; k < newModel.K; k++){
        newModel.p[k] = (trnModel.nw[w][k] + newModel.nw[_w][k] + newModel.beta)/(trnModel.nwsum[k] +  newModel.nwsum[k] + Vbeta) *
                (newModel.nd[m][k] + newModel.alpha)/(newModel.ndsum[m] + Kalpha);
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 潜在狄利克雷分配 (LDA) 主题生成 - latent Dirichlet allocation (LDA) Topics generation Latent Dirichlet Allocation主题数量未知 - Latent Dirichlet Allocation where number of topics is unknown Gensim 的潜在狄利克雷分配实现 - Latent Dirichlet Allocation Implementation with Gensim 潜在狄利克雷分配与文档聚类的关系 - The relationship between latent Dirichlet allocation and documents clustering 文档分类的监督潜在狄利克雷分配? - Supervised Latent Dirichlet Allocation for Document Classification? Spark Latent Dirichlet分配模型主题矩阵太小 - Spark Latent Dirichlet Allocation model topic matrix is too small 使用稀疏数据,训练 LDA(潜在狄利克雷分配)并预测新文档的更快方法是什么? - With sparse data,what is the faster way to train LDA( Latent Dirichlet allocation ) and predict for a new document? 提取后使用潜在Dirichlet分配的变换方法时出错 - Error when using transform method from Latent Dirichlet Allocation after unpickling 使用自动编码器重建潜在空间 - reconstruction latent space with autoencoder Dirichlet过程中的Dirac Delta质量点 - Mass Point, Dirac Delta in Dirichlet Processes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM