简体   繁体   中英

Topics and Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative model which produces a list of topics. Each topic is represented by a distribution over words. Assume each topic is represented by its top 40 words.

Given a new document how can I determine which topics made up this new document with out needing to re-run lda again. In other words, how can I use the estimated topics to infer the topic of a new unseen document.

Update:

for estimation we do the following (i ignored the hyperparameters for simplicity)

for(int iter=0;iter<1000;iter++){
  for(int token=0;token<numTokens;token++){
     double[] values=new double[numTopics];
     double pz=0.0d;
    for(int topic=0;topic<numTopics;topic++){
         probabilityOfWordGivenTopic= topicCount[topic][word[token]] / numTokenInTopic[topic];
         probabilityOfTopicGivenDocument= docCount[doc[token]][topic] / docLength[doc[token]];
         pz= probabilityOfWordGivenTopic * probabilityOfTopicGivenDocument;
         values[topic]=pz;
         total+=pz;
   }
 }
}

Thank you.

In the inference step, you basically need to assign topics to words of the new documents.

For seen words, use your estimated model to assign the probabilities. Taking your example, since you have 40 topics, you already have learnt the word-topic distribution (the phi matrix) during the LDA estimation. Now, for a word seen during training, say w, take the w-th column vector of this matrix, which is of size 40. This vector gives you the class membership probabilities of word w into each topic. Say for example this vector is (.02, .01, .... .004), which means that P(w|t_1)=.02, and so on.

In the new document, wherever you see this word w, draw a sample from this distribution and assign a topic to it. Clearly, it's more likely for this word w to be assigned to its true (technically speaking, estimated) topic class learnt from the estimation process.

For OOV words (ie words which you haven't seen during the training), one common practice is to use a uniform distribution, ie in your example use the probability of 1/40 to assign topics to it.

Edit

A code snippet extracted from JGibbsLDA follows:

        for (int m = 0; m < newModel.M; ++m){
            for (int n = 0; n < newModel.data.docs[m].length; n++){
                // (newz_i = newz[m][n]
                // sample from p(z_i|z_-1,w)
                int topic = infSampling(m, n);
                newModel.z[m].set(n, topic);
            }
        }//end foreach new doc

The main step in the inferencing sampling is to assign the probabilities for a word w. Note that this probability depends partly on the estimated model probability (trnModel.nw[w][k] of the code) and partly on the new assignment probabilities (newModel.nw[_w][k]). For OOV words, trnModel.nw[w][k] is set to 1/K. This probability doesn't depend on P(w|d). Instead P(w|d) is just a posterior probability computed after the topic assignments are done via Gibbs sampling.

    // do multinomial sampling via cummulative method
    for (int k = 0; k < newModel.K; k++){
        newModel.p[k] = (trnModel.nw[w][k] + newModel.nw[_w][k] + newModel.beta)/(trnModel.nwsum[k] +  newModel.nwsum[k] + Vbeta) *
                (newModel.nd[m][k] + newModel.alpha)/(newModel.ndsum[m] + Kalpha);
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM