簡體   English   中英

如何使用潛在語義分析(LSA)將文檔歸類為主題

[英]How to cluster documents under topics using latent semantic analysis (lsa)

我一直在進行潛在語義分析(lsa),並應用了以下示例: https : //radimrehurek.com/gensim/tut2.html

它包括術語“在主題下聚類”,但找不到任何方法可以在主題下聚類文檔。

在該示例中,它說:“根據LSI看來,“樹”,“圖”和“未成年人”都是相關詞(並在第一個主題的方向上貢獻最大),而第二個主題實際上是在關注本身以及其他所有詞語。 不出所料,前五個文檔與第二個主題相關性更強,而其余四個文檔與第一個主題相關性更強。

我們如何將這五個帶有Python代碼的文檔與相關主題相關聯?

您可以在下面找到我的python代碼。 我將不勝感激任何幫助。

from numpy import asarray
from gensim import corpora, models, similarities

#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]

dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]

# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)

corpus_lsi = lsi[corpus_tfidf]


#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
    print lsi.print_topics(i)

for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

corpus_lsi有9個向量的列表,這是文檔數。 每個向量在其第i個索引中存儲該文檔屬於主題i的可能性。 如果僅要將文檔分配給1個主題,請選擇向量中具有最高值的主題索引。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM