简体   繁体   English

文本聚类的主题建模效率低下

[英]Inefficiency of topic modelling for text clustering

I tried doing text clustering using LDA, but it isn't giving me distinct clusters. 我尝试使用LDA进行文本聚类,但是并没有给我独特的聚类。 Below is my code 下面是我的代码

#Import libraries
from gensim import corpora, models
import pandas as pd
from gensim.parsing.preprocessing import STOPWORDS
from itertools import chain

#stop words
stoplist = list(STOPWORDS)
new = ['education','certification','certificate','certified']
stoplist.extend(new)
stoplist.sort()

#read data
dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist()
#remove stop words
texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat]
#dictionary
dictionary = corpora.Dictionary(texts)
#corpus
corpus = [dictionary.doc2bow(text) for text in texts]
#train model
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=25, workers=4,minimum_probability=0)
#print topics
lda.print_topics(num_topics=25, num_words=7)
#get corpus
lda_corpus = lda[corpus]
#calculate cutoff score
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))


#threshold
threshold = sum(scores)/len(scores)
threshold
**0.039999999971137644**

#cluster1
cluster1 = [j for i,j in zip(lda_corpus,dat) if i[0][1] > threshold]

#cluster2
cluster2 = [j for i,j in zip(lda_corpus,dat) if i[1][1] > threshold]

The problem is there are overlapping elements in cluster1, which tend to be present in cluster2 and so on. 问题是在cluster1中有重叠的元素,这些元素往往出现在cluster2中,依此类推。

I also tried to increase threshold manually to 0.5, however it is giving me the same issue 我也尝试将阈值手动增加到0.5,但这给了我同样的问题

That is just realistic. 那是现实的。

Neither documents or words are usually uniquely assignable to a single cluster. 通常,文档或单词都不能唯一地分配给单个群集。

If you'd manually label some data, you will also quickly find some documents that cannot be clearly labeled as one or the other. 如果您要手动标记一些数据,则还将快速找到一些不能清楚地标记为一个或另一个的文档。 So it's good I'd the algorithm doesn't pretend there were a good unique assignment. 所以很好,我想算法不会假装有一个很好的唯一分配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM