简体   繁体   中英

Latent Dirichlet Allocation Solution Example

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).

Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:

  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.

Then he does some "calculations"

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B

And take guesses of the topics:

  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
    • at which point, you could interpret topic A to be about food
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
    • at which point, you could interpret topic B to be about cute animals

Your question is how did he come up with those numbers? Which words in these sentences carry "information":

  • broccoli, bananas, smoothie, breakfast, munching, eat
  • chinchilla, kitten, cute, adopted, hampster

Now let's go sentence by sentence getting words from each topic:

  • food 3, cute 0 --> food
  • food 5, cute 0 --> food
  • food 0, cute 3 --> cute
  • food 0, cute 2 --> cute
  • food 2, cute 2 --> 50% food + 50% cute

So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.


We made two calculations in our heads:

  • to look at the sentences and come up with 2 topics in the first place. LDA does this by considering each sentence as a "mixture" of topics and guessing the parameters of each topic.
  • to decide which words are important. LDA uses "term-frequency/inverse-document-frequency" to understand this.

LDA Procedure

Step1: Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)

Step2: This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones

So, to improve upon them: For each document d, go through each word w and compute:

  • p(topic t | document d): proportion of words in document d that are assigned to topic t

  • p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w

Step3: Reassign word wa new topic t', where we choose topic t' with probability

  • p(topic t' | document d) * p(word w | topic t')

This generative model predicts the probability that topic t' generated word w. we will iterate this last step multiple times for each document in the corpus to get steady-state.

Solved calculation

Let's say you have two documents.

Doc i: “T he bank called about the money.

Doc ii: “T he bank said the money was approved.

After removing the stop words, capitalization, and punctuation.

Unique words in corpus: bank called about money boat approved在此处输入图片说明 Next then,

在此处输入图片说明 After then, we will randomly select a word from doc i (word bank with topic assignment 1 ) and we will remove its assigned topic and we will calculate the probability for its new assignment.

在此处输入图片说明

For the topic k=1 在此处输入图片说明

For the topic k=2 在此处输入图片说明

Now we will calculate the product of those two probabilities as given below: 在此处输入图片说明

Good fit for both document and word for topic 2 (area is greater ) than topic 1. So, our new assignment for word bank will be topic 2.

Now, we will update the count due to new assignment. 在此处输入图片说明

Now we will repeat the same step of reassignment. and iterate through each word of the whole corpus. 在此处输入图片说明

I am learning LDA from online available blogs and articles. The link in question as well as in answer, both are not working. Is there any other link from where I can get understanding of LDA in Layman's term.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM