简体   繁体   English

如何找到 SkLearn 模型的 LSA 和 LDA 的 Coherence Score?

[英]How do I find Coherence Score for LSA and LDA for SkLearn Models?

I want to compare coherence scores for LSA and LDA models.我想比较 LSA 和 LDA 模型的一致性分数。

LSA model LSA model

lsa_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=40, random_state=5000)

lsa_top=lsa_model.fit_transform(vect_text)

LDA model低密度脂蛋白 model

lda_model=LatentDirichletAllocation(n_components=20,learning_method='online',random_state=42,max_iter=1) 

Can someone please help me calculate the coherence scores of these 2 models?有人可以帮我计算这两个模型的一致性分数吗?

Thank you in advance!先感谢您!

I'm using sklearn TfidfVectorizer combined with TruncatedSVD to find best topics for my corpus.我正在使用 sklearn TfidfVectorizer 结合 TruncatedSVD 来为我的语料库找到最佳主题。 Could not find built in coherence for TruncatedSVD and had to implement my own.找不到 TruncatedSVD 的内置连贯性,必须实现我自己的连贯性。 The code is based on this article:代码基于这篇文章:

http://qpleple.com/topic-coherence-to-evaluate-topic-models/http://qpleple.com/topic-coherence-to-evaluate-topic-models/

I've decided to stick to intrinsic UMass measure since it it relatively easy to implement.我决定坚持使用 UMass 内在测量方法,因为它相对容易实施。 Support methods are:支持方式有:

def get_umass_score(dt_matrix, i, j):
    zo_matrix = (dt_matrix > 0).astype(int)
    col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
    col_ij = col_i + col_j
    col_ij = (col_ij == 2).astype(int)    
    Di, Dij = col_i.sum(), col_ij.sum()    
    return math.log((Dij + 1) / Di)

def get_topic_coherence(dt_matrix, topic, n_top_words):
    indexed_topic = zip(topic, range(0, len(topic)))
    topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
    coherence = 0
    for j_index in range(0, len(topic_top)):
        for i_index in range(0, j_index - 1):
            i = topic_top[i_index][1]
            j = topic_top[j_index][1]
            coherence += get_umass_score(dt_matrix, i, j)
    return coherence

def get_average_topic_coherence(dt_matrix, topics, n_top_words):
    total_coherence = 0
    for i in range(0, len(topics)):
        total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
    return total_coherence / len(topics)

The usage:用法:

for n_topics in range(5, 1000, 50):
    svd = TruncatedSVD(n_components=n_topics, n_iter=7, random_state=42)
    svd.fit(tfidf_matrix)
    avg_coherence = get_average_topic_coherence(tfidf_matrix, svd.components_, 10)
    print(str(n_topics) + " " + str(avg_coherence))

Output: Output:

5 -72.44478726897546
55 -86.18040144608892
105 -88.9175058514422
155 -90.3841147807378
205 -91.83948259181923
255 -92.01751480271953 < best
305 -90.73603639282118
355 -89.85740639388695
405 -89.41916273620417
455 -87.66472648614531
505 -85.06725618307024
555 -81.1419066684933
605 -77.03963739283286
655 -73.04509144854593
705 -69.84849596265884
755 -68.01357538571055
805 -67.48039395600706
855 -67.53091204608572
905 -67.23467504644942
955 -66.86079451952988

The lower UMass coherence - the better. UMass 一致性越低越好。 In my case 255 topics is the best fit for my corpus.就我而言,255 个主题最适合我的语料库。 And I used 10 most relevant words for a topic - you can use your number.我为一个主题使用了 10 个最相关的词 - 你可以使用你的号码。 You will get different numbers, but optimal number of topics (SVD components) will generally be the same.您会得到不同的数字,但主题(SVD 组件)的最佳数量通常是相同的。

I'm using TF-IDF vectors, but this coherence should work on any term frequency based approach (ex. BOW)我正在使用 TF-IDF 向量,但这种连贯性应该适用于任何基于词频的方法(例如 BOW)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM