简体   繁体   English

为什么 Biterm 主题 Model (BTM) 返回一致性分数 -100?

[英]Why does Biterm Topic Model (BTM) returns coherence score -100?

I am using biterm.cbtm library to train a topic model of about 2500 short posts.我正在使用biterm.cbtm库来训练一个主题 model 大约 2500 个短帖子。 When BTM finishes, I get the following 10 topics, along with the topic coherence value as shown in this picture: https://ibb.co/Kqy992H当 BTM 完成时,我得到以下 10 个主题,以及如图所示的主题一致性值: https://ibb.co/Kqy992H

I am trying to understand what those negative coherence values mean and why they are so low.我试图理解那些负面的相干值是什么意思,以及为什么它们如此之低。 I read a lot of related research and I couldn't find one paper that explains the range of the coherence value.我阅读了很多相关研究,但找不到一篇解释相干值范围的论文。 Also, most of the papers where about LDA coherence value, as BTM is not well documented.此外,大多数关于 LDA 一致性值的论文,因为 BTM 没有得到很好的记录。

Does anyone know the range/meaning of the coherence value I am getting?有谁知道我得到的相干值的范围/含义? Why is coherence between -76 and -111?为什么 -76 和 -111 之间有一致性?

You can see my code below:你可以在下面看到我的代码:

 from sklearn.feature_extraction.text import CountVectorizer from biterm.utility import vec_to_biterms import numpy as np import pyLDAvis from biterm.cbtm import oBTM from sklearn.feature_extraction.text import CountVectorizer from biterm.utility import vec_to_biterms, topic_summuary # helper functions import pickle import pandas as pd from numpy import array import numpy as np import logging import pyLDAvis.gensim import json import warnings import pickle import pandas as pd import re warnings.filterwarnings('ignore') # To ignore all warnings that arise here to enhance clarity from gensim.models.coherencemodel import CoherenceModel from gensim.models.ldamodel import LdaModel from gensim.corpora.dictionary import Dictionary from nltk.stem.wordnet import WordNetLemmatizer from nltk.tokenize import RegexpTokenizer import matplotlib.pyplot as plt from gensim import corpora, models from gensim.models import Phrases import time def docs_preprocessor(docs): tokenizer = RegexpTokenizer(r'\w+') for idx in range(len(docs)): docs[idx] = re.sub(r'(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*', '', docs[idx]) docs[idx] = docs[idx].lower() # Convert to lowercase. if len(docs[idx]) < 50: continue docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words. # Remove numbers, but not words that contain numbers. docs = [[token for token in doc if not token.isdigit()] for doc in docs] # Remove words that are only one character. docs = [[token for token in doc if len(token) > 3] for doc in docs] # Lemmatize all words in documents. lemmatizer = WordNetLemmatizer() docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs] return docs colnames = ['post'] with open('cleantext.p', 'rb') as handle: dict = pickle.load(handle) dict['text'] = list(filter(None.__ne__, dict['text'])) print("Total posts: " + str(len(dict['text']))) p_df = pd.DataFrame.from_dict(dict)#, skiprows = lambda x: logic(x)) docs = array(p_df['text']) print("ALL DOCUMENTS: " + str(len(docs))) docs = docs_preprocessor(docs) outfile = open("posts.txt", "w+") total_docs = 0 for sentence in docs: if len(sentence) < 3: continue else: total_docs += 1 for word in sentence: result = ''.join([i for i in word if not i.isdigit()]) outfile.write(result + " ") outfile.write("\n") outfile.close() print("Total docs: " + str(total_docs)) print("Reading sentences. . .") texts = open('posts.txt', 'r').read().splitlines() clear_text = "" for item in texts: clear_text = clear_text + " " + item vec = CountVectorizer(stop_words='english') print("Building Vectors. . .") X = vec.fit_transform(texts).toarray() print("Building Vocabulary. . .") vocab = np.array(vec.get_feature_names()) biterms = vec_to_biterms(X) print("BTM modelling. . .") btm = oBTM(num_topics=10, V=vocab) print("\n\n Train Online BTM..") btm.fit(biterms, iterations=100) topics = btm.transform(biterms) print("\n\n Topic coherence..") topic_summuary(btm.phi_wz.T, X, vocab, 10) #I am getting a weird error about pyLDAvis here. Why? print("\n\n Visualize Topics..") vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0)) pyLDAvis.save_html(vis, 'btm.html')

I guess we can refer to the idea in coherence understanding of LDA since the formula should be the same.:)我想我们可以参考 LDA 的连贯性理解中的想法,因为公式应该是相同的。:)

You may take a look at the interpretation here: :) Negative Values: Evaluate Gensim LDA with Topic Coherence你可以看看这里的解释::) Negative Values: Evaluate Gensim LDA with Topic Coherence

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM