簡體   English   中英

我應該使用哪個gensim語料庫來加載LDA轉換語料庫? - Python

[英]Which gensim corpora class should I use to load an LDA transformed corpus? - Python

如何從python的gensim加載LDA轉換語料庫? 我嘗試過的:

from gensim import corpora, models
import numpy.random
numpy.random.seed(10)

doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]
dictionary = corpora.Dictionary(corpus)

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf.save('x.corpus_tfidf')

# To access the tfidf fitted corpus i've saved i used corpora.MmCorpus.load()
corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf')

lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda[corpus]
corpus_lda.save('x.corpus_lda')

for i,j in enumerate(corpus_lda):
  print j, corpus[i]

上面的代碼將輸出:

[(0, 0.54259038344543631), (1, 0.45740961655456358)] [(0, 1), (1, 1)]
[(0, 0.56718063124157458), (1, 0.43281936875842542)] [(0, 1)]
[(0, 0.54255407573666647), (1, 0.45744592426333358)] [(0, 1), (1, 1)]
[(0, 0.75229707773868093), (1, 0.2477029222613191)] [(0, 3), (1, 1)]

# [(<topic_number_from x.corpus_lda model>, 
#   <probability of this topic for this document>), 
#  (<topic# from lda model>, <prob of this top for this doc>)] [<document[i] from corpus>]

如果我想加載保存的LDA轉換語料庫,我應該使用gensim哪個類加載?

我嘗試過使用corpora.MmCorpus.load() ,它沒有給我相同的輸出語料庫輸出,如上所示:

>>> lda_corpus = corpora.MmCorpus.load('x.corpus_lda')
>>> for i,j in enumerate(lda_corpus):
...   print j, corpus[i]
... 
[(0, 0.55087839240547309), (1, 0.44912160759452685)] [(0, 1), (1, 1)]
[(0, 0.56715974584850259), (1, 0.43284025415149735)] [(0, 1)]
[(0, 0.54275680271070581), (1, 0.45724319728929413)] [(0, 1), (1, 1)]
[(0, 0.75233330695720912), (1, 0.24766669304279079)] [(0, 3), (1, 1)]

您的代碼中存在更多問題。

要以MatrixMarket格式保存語料庫,您需要

corpora.MmCorpus.serialize('x.corpus_lda', corpus_lda)

文檔在這里

你正在訓練corpus_tfidf ,但后來只轉換lda[corpus] (沒有tfidf)。 要么使用tfidf,要么使用簡單的詞袋,但要始終如一地使用它。

在嘗試了corpora.XCorpushttp://radimrehurek.com/gensim/apiref.html )中的所有可能的類之后,我嘗試使用BleiCorpus進行加載,看起來它生成了相同的輸出,其中包含較少的十進制數字模型。

>>> from gensim import corpora, models
>>> import numpy.random
>>> numpy.random.seed(10)
>>> 
>>> doc0 = [(0, 1), (1, 1)]
>>> doc1 = [(0,1)]
>>> doc2 = [(0, 1), (1, 1)]
>>> doc3 = [(0, 3), (1, 1)]
>>> corpus = [doc0,doc1,doc2,doc3]
>>> dictionary = corpora.Dictionary(corpus)
>>> 
>>> tfidf = models.TfidfModel(corpus)
>>> corpus_tfidf = tfidf[corpus]
>>> 
>>> lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=3)
>>> corpus_lda = lda[corpus]
>>> corpus_lda.save('x.corpus_lda')
>>> 
>>> for i,j in enumerate(corpus_lda):
...   print j, corpus[i]
... 
[(0, 0.15441373560695118), (1, 0.56498524668290762), (2, 0.28060101771014123)] [(0, 1), (1, 1)]
[(0, 0.59512220481946487), (1, 0.22817873367464175), (2, 0.17669906150589348)] [(0, 1)]
[(0, 0.52219543266162705), (1, 0.15449347037173339), (2, 0.32331109696663957)] [(0, 1), (1, 1)]
[(0, 0.83364632205849853), (1, 0.086514534997754619), (2, 0.079839142943746944)] [(0, 3), (1, 1)]
>>>
>>> lda_corpus = corpora.BleiCorpus.load('x.corpus_lda')
>>> for i,j in enumerate(lda_corpus):
...   print j, corpus[i]
... 
[(0, 0.154413735607), (1, 0.564985246683), (2, 0.280601017710)] [(0, 1), (1, 1)]
[(0, 0.595122204819), (1, 0.228178733675), (2, 0.176699061506)] [(0, 1)]
[(0, 0.522195432662), (1, 0.154493470372), (2, 0.323311096967)] [(0, 1), (1, 1)]
[(0, 0.833646322058), (1, 0.086514534998), (2, 0.079839142944)] [(0, 3), (1, 1)]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM