简体   繁体   English

如何使用潜在语义分析(LSA)将文档归类为主题

[英]How to cluster documents under topics using latent semantic analysis (lsa)

I've been working on latent semantic analysis (lsa) and applied this example: https://radimrehurek.com/gensim/tut2.html 我一直在进行潜在语义分析(lsa),并应用了以下示例: https : //radimrehurek.com/gensim/tut2.html

It includes the terms clustering under topics but couldn't find anything how we can cluster documents under topics. 它包括术语“在主题下聚类”,但找不到任何方法可以在主题下聚类文档。

In that example, it says that 'It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. 在该示例中,它说:“根据LSI看来,“树”,“图”和“未成年人”都是相关词(并在第一个主题的方向上贡献最大),而第二个主题实际上是在关注本身以及其他所有词语。 As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic'. 不出所料,前五个文档与第二个主题相关性更强,而其余四个文档与第一个主题相关性更强。

How can we relate those five documents with Python code to the related topic? 我们如何将这五个带有Python代码的文档与相关主题相关联?

You can find my python code below. 您可以在下面找到我的python代码。 I would appreciate any help. 我将不胜感激任何帮助。

from numpy import asarray
from gensim import corpora, models, similarities

#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]

dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]

# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)

corpus_lsi = lsi[corpus_tfidf]


#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
    print lsi.print_topics(i)

for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

corpus_lsi has a list of 9 vectors, which is the number of documents. corpus_lsi有9个向量的列表,这是文档数。 Each vector stores in at its i-th index the likeliness that this document belongs to topic i. 每个向量在其第i个索引中存储该文档属于主题i的可能性。 If you just want to assign a document to 1 topic, choose the topic-index with the highest value in your vector. 如果仅要将文档分配给1个主题,请选择向量中具有最高值的主题索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM