繁体   English   中英

如何使用潜在语义分析(LSA)将文档归类为主题

[英]How to cluster documents under topics using latent semantic analysis (lsa)

我一直在进行潜在语义分析(lsa),并应用了以下示例: https : //radimrehurek.com/gensim/tut2.html

它包括术语“在主题下聚类”,但找不到任何方法可以在主题下聚类文档。

在该示例中,它说:“根据LSI看来,“树”,“图”和“未成年人”都是相关词(并在第一个主题的方向上贡献最大),而第二个主题实际上是在关注本身以及其他所有词语。 不出所料,前五个文档与第二个主题相关性更强,而其余四个文档与第一个主题相关性更强。

我们如何将这五个带有Python代码的文档与相关主题相关联?

您可以在下面找到我的python代码。 我将不胜感激任何帮助。

from numpy import asarray
from gensim import corpora, models, similarities

#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]

dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]

# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)

corpus_lsi = lsi[corpus_tfidf]


#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
    print lsi.print_topics(i)

for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

corpus_lsi有9个向量的列表,这是文档数。 每个向量在其第i个索引中存储该文档属于主题i的可能性。 如果仅要将文档分配给1个主题,请选择向量中具有最高值的主题索引。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM