如何找到 SkLearn 模型的 LSA 和 LDA 的 Coherence Score？

Question

我想比较 LSA 和 LDA 模型的一致性分数。

LSA model

lsa_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=40, random_state=5000)

lsa_top=lsa_model.fit_transform(vect_text)

低密度脂蛋白 model

lda_model=LatentDirichletAllocation(n_components=20,learning_method='online',random_state=42,max_iter=1)

有人可以帮我计算这两个模型的一致性分数吗？

先感谢您！

Answer 1

我正在使用 sklearn TfidfVectorizer 结合 TruncatedSVD 来为我的语料库找到最佳主题。 找不到 TruncatedSVD 的内置连贯性，必须实现我自己的连贯性。 代码基于这篇文章：

http://qpleple.com/topic-coherence-to-evaluate-topic-models/

我决定坚持使用 UMass 内在测量方法，因为它相对容易实施。 支持方式有：

def get_umass_score(dt_matrix, i, j):
    zo_matrix = (dt_matrix > 0).astype(int)
    col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
    col_ij = col_i + col_j
    col_ij = (col_ij == 2).astype(int)    
    Di, Dij = col_i.sum(), col_ij.sum()    
    return math.log((Dij + 1) / Di)

def get_topic_coherence(dt_matrix, topic, n_top_words):
    indexed_topic = zip(topic, range(0, len(topic)))
    topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
    coherence = 0
    for j_index in range(0, len(topic_top)):
        for i_index in range(0, j_index - 1):
            i = topic_top[i_index][1]
            j = topic_top[j_index][1]
            coherence += get_umass_score(dt_matrix, i, j)
    return coherence

def get_average_topic_coherence(dt_matrix, topics, n_top_words):
    total_coherence = 0
    for i in range(0, len(topics)):
        total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
    return total_coherence / len(topics)

用法：

for n_topics in range(5, 1000, 50):
    svd = TruncatedSVD(n_components=n_topics, n_iter=7, random_state=42)
    svd.fit(tfidf_matrix)
    avg_coherence = get_average_topic_coherence(tfidf_matrix, svd.components_, 10)
    print(str(n_topics) + " " + str(avg_coherence))

Output：

5 -72.44478726897546
55 -86.18040144608892
105 -88.9175058514422
155 -90.3841147807378
205 -91.83948259181923
255 -92.01751480271953 < best
305 -90.73603639282118
355 -89.85740639388695
405 -89.41916273620417
455 -87.66472648614531
505 -85.06725618307024
555 -81.1419066684933
605 -77.03963739283286
655 -73.04509144854593
705 -69.84849596265884
755 -68.01357538571055
805 -67.48039395600706
855 -67.53091204608572
905 -67.23467504644942
955 -66.86079451952988

UMass 一致性越低越好。 就我而言，255 个主题最适合我的语料库。 我为一个主题使用了 10 个最相关的词 - 你可以使用你的号码。 您会得到不同的数字，但主题（SVD 组件）的最佳数量通常是相同的。

我正在使用 TF-IDF 向量，但这种连贯性应该适用于任何基于词频的方法（例如 BOW）

如何找到 SkLearn 模型的 LSA 和 LDA 的 Coherence Score？

问题描述

1 个解决方案

解决方案1
1 2022-02-19 12:32:35

如何找到 SkLearn 模型的 LSA 和 LDA 的 Coherence Score？

问题描述

1 个解决方案

解决方案1 1 2022-02-19 12:32:35

解决方案1
1 2022-02-19 12:32:35