简体   繁体   English

LDA如何提供一致的结果?

[英]How does LDA give consistent results?

The popular topic model, Latent Dirichlet Allocation (LDA), which when used to extract topics from a corpus, returns different topics with different probability distributions over the dictionary words. 流行的主题模型Latent Dirichlet Allocation(LDA),当用于从语料库中提取主题时,返回在词典单词上具有不同概率分布的不同主题。

Whereas Latent Semantic Indexing (LSI) gives the same topics and same distributions after every iteration. 而潜在语义索引(LSI)在每次迭代后提供相同的主题和相同的分布。

In reality LDA is widely used to extract topics. 实际上,LDA被广泛用于提取主题。 How does LDA maintain consistency if it returns different topic distribution every time a classification is made? 如果LDA每次进行分类时返回不同的主题分布,该如何保持一致性?

Consider this simple example. 考虑这个简单的例子。 A sample of documents are taken where D represents a document: 抽取文档样本,其中D代表文档:

D1: Linear Algebra techniques for dimensionality reduction
D2: dimensionality reduction of a sample database
D3: An introduction to linear algebra
D4: Measure of similarity and dissimilarity of different web documents
D5: Classification of data using database sample
D6: overfitting due lack of representative samples
D7: handling overfitting in descision tree
D8: proximity measure for web documents
D9: introduction to web query classification
D10: classification using LSI 

Each line represents a document. 每行代表一个文档。 On the above corpus the LDA model is used to generate the topics from the document. 在上述语料库中,LDA模型用于从文档生成主题。 Gensim is used for LDA, batch LDA is performed where number of topics chosen are 4 and number of passes are 20. Gensim用于LDA,执行批量LDA,其中选择的主题数为4,并且通过次数为20。

Now on the original corpus the batch LDA is performed and the topics generated after 20 passes are: 现在在原始语料库上执行批量LDA,并且在20次传递后生成的主题是:

topic #0: 0.045*query + 0.043*introduction + 0.042*similarity + 0.042*different + 0.041*reduction + 0.040*handling + 0.039*techniques + 0.039*dimensionality + 0.039*web + 0.039*using

topic #1: 0.043*tree + 0.042*lack + 0.041*reduction + 0.040*measure + 0.040*descision + 0.039*documents + 0.039*overfitting + 0.038*algebra + 0.038*proximity + 0.038*query

topic #2: 0.043*reduction + 0.043*data + 0.042*proximity + 0.041*linear + 0.040*database + 0.040*samples + 0.040*overfitting + 0.039*lsi + 0.039*introduction + 0.039*using

topic #3: 0.046*lsi + 0.045*query + 0.043*samples + 0.040*linear + 0.040*similarity + 0.039*classification + 0.039*algebra + 0.039*documents + 0.038*handling + 0.037*sample

Now batch LDA is performed on the same original corpus again and the topics generated in that case are: 现在批量LDA再次在同一原始语料库上执行,在这种情况下生成的主题是:

topic #0: 0.041*data + 0.041*descision + 0.041*linear + 0.041*techniques + 0.040*dimensionality + 0.040*dissimilarity + 0.040*database + 0.040*reduction + 0.039*documents + 0.038*proximity

topic #1: 0.042*dissimilarity + 0.041*documents + 0.041*dimensionality + 0.040*tree + 0.040*proximity + 0.040*different + 0.038*descision + 0.038*algebra + 0.038*similarity + 0.038*techniques

topic #2: 0.043*proximity + 0.042*data + 0.041*database + 0.041*different + 0.041*tree + 0.040*techniques + 0.040*linear + 0.039*classification + 0.038*measure + 0.038*representative

topic #3: 0.043*similarity + 0.042*documents + 0.041*algebra + 0.041*web + 0.040*proximity + 0.040*handling + 0.039*dissimilarity + 0.038*representative + 0.038*tree + 0.038*measure

The word distribution in each topic is not same in both the cases. 在这两种情况下,每个主题中的单词分布都不相同。 In fact, the word distribution is never the same. 事实上,单词分配从来都不一样。

So how does LDA work effectively if it doesn't have the same word distribution in its topics like LSI? 那么,如果LDA在其主题(如LSI)中的词分布不相同,该如何有效地工作?

I think there's two issues here. 我认为这里有两个问题。 Firstly, LDA training is not deterministic like LSI is; 首先,LDA 培训不像LSI那样具有确定性; the common training algorithms for LDA are sampling methods. LDA的常用训练算法是采样方法。 If results over multiple training runs are wildly different, that's either a bug, or you've used the wrong settings, or plain bad luck. 如果多次训练的结果差别很大,那就是一个bug,或者你使用了错误的设置,或者说运气不好。 You can try multiple runs of LDA training if you're trying to optimize some function. 如果您正在尝试优化某些功能,可以尝试多次运行LDA培训。

Then as for clustering, querying and classification: once you have a trained LDA model, you can apply that model to other documents in a deterministic way. 然后为集群,查询和分类:一旦你有一个训练有素的LDA模型,可以应用该模型到其他文档以确定的方式。 Different LDA models will give you different results, but from one LDA model that you've labeled as the final model, you'll always get the same result. 不同的LDA模型会给你不同的结果,但是从你标记为最终模型的一个LDA模型中,你总会得到相同的结果。

If LDA uses randomness in both training and inference steps, it will generate different topics everytime. 如果LDA在训练和推理步骤中使用随机性,则每次都会生成不同的主题。 See this link: LDA model generates different topics everytime i train on the same corpus 请参阅此链接: 每次我在同一语料库上训练时,LDA模型会生成不同的主题

There are three solutions to this problem: 这个问题有三种解决方案:

  1. set a random_seed = 123 设置random_seed = 123
  2. pickle - you can save your trained model as a file and reaccess as you'd like without changing the topics. pickle - 您可以将训练好的模型保存为文件,并根据需要重新访问,而无需更改主题。 You can even transfer this file to another machine and implement it by calling. 您甚至可以将该文件传输到另一台计算机上,并通过调用来实现它。 We create a file name for the pre-trained model open the file to save as a pickle. 我们为预先训练的模型创建一个文件名,打开文件以保存为泡菜。 Close the pickle Instance. 关闭泡菜实例。 Loading the saved LDA Mallet wrapped pickle: 加载已保存的LDA Mallet包裹的泡菜:

     LDAMallet_file = 'Your Model' LDAMallet_pkl = open(LDAMallet_file, 'wb') pickle.dump(ldamallet, LDAMallet_pkl) LDAMallet_pkl_15.close() LDAMallet_file = 'Your Model' LDAMallet_pkl = open(LDAMallet_file, 'rb') ldamallet = pickle.load(LDAMallet_pkl) print("Loaded LDA Mallet wrap --", ldamallet) 

    Check out the documentation: https://docs.python.org/3/library/pickle.html 查看文档: https//docs.python.org/3/library/pickle.html

    Get it? 得到它? pickle because it preserves ;) 泡菜,因为它保留;)

  3. joblib - same as pickle better with arrays joblib - 与数组更好的pickle相同

I hope this helps :) 我希望这有帮助 :)

I am not entirely sure if I understand the problem, but to make it precise you are saying that LDA produces different topic distribution on different run for the same set of data. 我不完全确定我是否理解这个问题,但为了使其准确,你说LDA在同一组数据的不同运行中产生不同的主题分布。

First LDA uses the randomness to get those probability distribution so for each run you will get different topic weights and words, but you can control this randomness. 首先,LDA使用随机性获得那些概率分布,因此对于每次运行,您将获得不同的主题权重和单词,但是您可以控制这种随机性。

gensim.models.ldamodel.LdaModel(
    corpus, num_topics=number_of_topics, id2word=dictionary, passes=15, random_state=1)

You see the use of random_state if you fix this number you can easily reproduce the output. 您可以看到使用random_state如果修复此数字,您可以轻松地重现输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM