Scikit学习中的可重现LDA模型

Question

I am using LDA for Topic modelling. 我正在使用LDA进行主题建模。

from sklearn.decomposition import LatentDirichletAllocation 从sklearn.decomposition导入LatentDirichletAllocation

Using a set of 10 files, I made the model. 我使用一组10个文件制作了模型。 Now, i try to cluster it into 3. 现在，我尝试将其群集为3。

Similar to below: 类似于以下内容：

''' “””

import numpy as np  
data = []
a1 = " a word in groupa doca"
a2 = " a word in groupa docb"
a3 = "a word in groupb docc"
a4 = "a word in groupc docd"
a5 ="a word in groupc doce"
data = [a1,a2,a3,a4,a5]
del a1,a2,a3,a4,a5

NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)


from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

NUM_TOPICS = 2

vectorizer = CountVectorizer(min_df=0.001, max_df=0.99998, 
                         stop_words='english', lowercase=True, 
                         token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)

# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, 
   max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)

vocab = vectorizer.get_feature_names()  
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())

for iDocIndex,text in enumerate(data):            
    x = list(lda_model.transform(vectorizer.transform([text]))[0])
    maxIndex = x.index(max(x))            
    if TOPICWISEDOCUMENTS[maxIndex]:
        TOPICWISEDOCUMENTS[maxIndex].append(iDocIndex) 
    else:
        TOPICWISEDOCUMENTS[maxIndex] = [iDocIndex]    



 print(TOPICWISEDOCUMENTS)

''' “””

Whenever I am running the system, I am getting different cluster even for the same set of input data. 每当我运行系统时，即使对于同一组输入数据，我也会获得不同的集群。

Alternatively, the LDA is not reproducible. 或者，LDA是不可复制的。

How to make it reproducible .. ? 如何使其可再现..？

Answer 1

For reproducibility in scikit, set random_state param in anywhere you see in your code. 为了在scikit中重现性，请在代码中看到的任何位置设置random_state参数。

In your case, its LatentDirichletAllocation(...) 在您的情况下，其LatentDirichletAllocation(...)

Use this: 用这个：

lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, 
                                      max_iter=10,  
                                      learning_method='online'
                                      random_state = 42)

Check this link: 检查此链接：

http://scikit-learn.org/stable/developers/utilities.html#validation-tools http://scikit-learn.org/stable/developers/utilities.html#validation-tools

If you want to make your whole script reproducible and dont want to search where to put random_state , you can set a global numpy random seed. 如果要使整个脚本具有可复制性，并且不想搜索将random_state放在random_state ，则可以设置一个全局numpy随机种子。

import numpy as np
np.random.seed(42)

See this: http://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution 请参阅： http : //scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution

Answer 2

lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, 
                                      max_iter=10,  
                                      learning_method='online'
                                      random_state = 42)

Worked ...!!! 工作... !!!

Thanks a lot 非常感谢

Also, I had tried for this 另外，我也尝试过

import numpy as np
np.random.seed(42)

But It is not effective. 但这是无效的。

Thanks for resolution 感谢您的解决

Scikit学习中的可重现LDA模型

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-05-21 09:57:27

解决方案2
0 2018-05-24 11:51:48

Scikit学习中的可重现LDA模型

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-05-21 09:57:27

解决方案2 0 2018-05-24 11:51:48

解决方案1
4 已采纳 2018-05-21 09:57:27

解决方案2
0 2018-05-24 11:51:48