Gensim.Similarity添加文檔或實時培訓

Question

關於這個項目的一點背景。 我有帶有標識符和文本的副本，例如{name: "sports-football", text: "Content related to football sports"} 。

我需要在這個語料庫中找到給定文本輸入的正確匹配。 但是，我能夠在某種程度上使用Gensim。 與LDA和LSI模型的相似性。

如何使用新文檔更新Genism.Similarity Index。 這里的想法是在現場階段繼續訓練模型。

這是我遵循的步驟。

QueryText =“瓜迪奧拉將萊昂內爾·梅西轉移到9號角色，這樣他就不必深入了解我認為阿圭羅經常會回到更深的位置。”

注意：有些代碼只是外行

使用創建索引

`similarities.Similarity(indexpath, model,topics)`

創建一個字典
dictionary = Dictionary(QueryText )
創建語料庫
corpus = Corpus(QueryText, dictionary)
創建LDA模型
LDAModel = ldaModel(corpus,dictionary)

更新現有字典，模型和索引

更新現有字典

existing_dictionary.add_document(dictionary)

更新現有的LDA模型

existing_lda_model.update(corpus)

更新現有的相似性指數

existing_index.add_dcoument(LDAModel[corpus])

除了以下警告更新似乎工作。

gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

讓我們運行查詢文本的相似性

vec_bow = dictionary.doc2bow(QueryText) 
vec_model = existing_lda_model[vec_bow] 
sims = existing_index[vec_model]

但是，它失敗了以下錯誤。

Similarity index with 723 documents in 1 shards (stored under \Files\models\lda_model)
Similarity index with 725 documents in 0 shards (stored under \Files\models\lda_model)
\lib\site-packages\gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2
  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-8fe711724367> in <module>()
     45 trigram = Trigram.apply_trigram_model(queryText, bigram, trigram)
     46 vec_bow = dictionry.doc2bow(trigram)
---> 47 vec_model =  lda_model[vec_bow]
     48 print(vec_model)
     49 

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in __getitem__(self, bow, eps)
   1103             `(topic_id, topic_probability)` 2-tuples.
   1104         """
-> 1105         return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
   1106 
   1107     def save(self, fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs):

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in get_document_topics(self, bow, minimum_probability, minimum_phi_value, per_word_topics)
    944             return self._apply(corpus, **kwargs)
    945 
--> 946         gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
    947         topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    948 

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
    442             Elogthetad = Elogtheta[d, :]
    443             expElogthetad = expElogtheta[d, :]
--> 444             expElogbetad = self.expElogbeta[:, ids]
    445 
    446             # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.

IndexError: index 718 is out of bounds for axis 1 with size 713

我真的很感激，幫助我。 期待很棒的回復。

Answer 1

后來的錯誤（ AssertionError: mismatch between supplied and computed number of non-zeros錯誤AssertionError: mismatch between supplied and computed number of non-zeros稀疏矩陣中AssertionError: mismatch between supplied and computed number of non-zeros ）很可能來自警告所建議的問題 - perwordbound溢出和使用其未定義值計算的矩陣使更新失敗。

我建議用更大的批次更新模型（不是單個查詢）。 可能存在不成比例的單詞數量，模型中您嘗試使用相對較少的單詞更新的單詞數。 對於浮動，這可能會導致細微的錯誤。

同樣，請嘗試使用與模型源數據成比例的批量更新模型（例如，其大小的1/10，1/20）。

修訂，基於此主題：

Melissa Roemmele寫道：

僅供參考，當我試圖在一個詞袋語料庫上為語料庫創建LSI索引而沒有先將其轉換為tf-idf時，我也遇到了這個錯誤。 我可以在單詞包上構建LSI模型，但為它構建索引給了我錯誤。

在將QueryText傳遞給模型之前，您可能想先嘗試使用tf-idf。

Gensim.Similarity添加文檔或實時培訓

問題描述

1 個解決方案

解決方案1
0 2018-01-23 22:29:09

Gensim.Similarity添加文檔或實時培訓

問題描述

1 個解決方案

解決方案1 0 2018-01-23 22:29:09

解決方案1
0 2018-01-23 22:29:09