如何解釋LDA組件（使用sklearn）？

Question

我使用Latent Dirichlet Allocation （ sklearn實現）來分析大約500篇科學文章摘要，並且我得到了包含最重要單詞的主題（用德語）。 我的問題是解釋與最重要的單詞相關的這些值。 我假設每個主題的所有單詞的概率加起來為1，但實際情況並非如此。

我怎樣才能解釋這些價值觀？ 例如，我希望能夠說明為什么主題＃20的單詞值比其他主題高得多。 他們的絕對高度與貝葉斯概率有關嗎？ 該主題在語料庫中更常見嗎？ 我還沒有把這些價值觀與LDA背后的數學結合在一起。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
                                analyzer='word',
                                tokenizer = stemmer_sklearn.stem_ger())

tf = tf_vectorizer.fit_transform(texts)

n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 
                                learning_method='online',                 
                                learning_offset=50., random_state=0)

lda.fit(tf)

def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Nr.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 |

Answer 1

從文檔中

components_主題詞分布的變體參數。 由於主題詞分發的完整條件是Dirichlet，因此components_ [i，j]可以被視為偽代碼，其表示將字j分配給主題i的次數。 它也可以被視為標准化后每個主題的單詞分布： model.components_ / model.components_.sum(axis=1)[:, np.newaxis] 。

因此，如果您對組件進行規范化以評估主題中每個術語的重要性，則可以將這些值視為分布。 AFAIU您不能使用偽計數來比較語料庫中兩個主題的重要性，因為它們是應用於術語 - 主題分布的平滑因子。

如何解釋LDA組件（使用sklearn）？

問題描述

1 個解決方案

解決方案1
1 2018-02-20 17:22:23

如何解釋LDA組件（使用sklearn）？

問題描述

1 個解決方案

解決方案1 1 2018-02-20 17:22:23

解決方案1
1 2018-02-20 17:22:23