[英]Gensim LDA model topic diff resulting in nan
I am pretty new at topic modeling and Gensim. 我在主题建模和Gensim方面还很陌生。 So, I am still trying to understand many of concepts.
因此,我仍在尝试理解许多概念。 I am trying to run gensim's LDA model on my corpus that contains around 25,446,114 tweets.
我试图在我的语料库上运行gensim的LDA模型,该模型包含约25,446,114条推文。 I created a streaming corpus and id2word dictionary using gensim.
我使用gensim创建了一个流式语料库和id2word字典。 I am using num_topics = 100, chunk size = 85000 (loading 85000 tweets at a time)
我正在使用num_topics = 100,块大小= 85000(一次加载85000条tweets)
I am using Gensim : 3.5.0 Numpy: 1.15.3 我正在使用Gensim:3.5.0 Numpy:1.15.3
Here is the link to corpus and id2word dictionary: https://drive.google.com/drive/folders/1FrJ8gJbiDqp3VC5syOjRVcQPcESdYOYa?usp=sharing 以下是语料库和id2word词典的链接: https ://drive.google.com/drive/folders/1FrJ8gJbiDqp3VC5syOjRVcQPcESdYOYa ? usp = sharing
I don't know what I am doing wrong or how to solve this. 我不知道我在做什么错或如何解决这个问题。 The topic diff first hits inf and then nan , and I start getting same topic.
主题diff首先到达inf,然后到nan,我开始得到相同的主题。 Please help !!
请帮忙 !!
Here is the code: 这是代码:
import pprint
import logging
import gensim
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
corpus = gensim.corpora.MmCorpus('disasterTweets.mm')
id2word = gensim.corpora.Dictionary.load('disasterTweets.dict')
id2word.filter_tokens(bad_ids=[id2word.token2id['eofeofeof']])
print('eofeofeof' in id2word.token2id)
lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
chunksize=85000,
num_topics=100)
pprint.pprint(lda_model.print_topics())
Here are the errors I am receiving: 这是我收到的错误:
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:1023: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
Process ForkPoolWorker-30:
Traceback (most recent call last):
File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/pool.py", line 105, in worker
initializer(*initargs)
File "/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamulticore.py", line 333, in worker_e_step
worker_lda.do_estep(chunk) # TODO: auto-tune alpha?
File "/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 725, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File "/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 662, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index 287500 is out of bounds for axis 1 with size 287500
From what I have understood reading the thread in Gensim Github Issues page issue 217 it seems that is a bug and some people there have reported that the problem was resolved by changing some of the parameters. 据我了解的阅读Gensim Github问题页面第217期中的线程,看来这是一个错误,那里的一些人报告说,该问题已通过更改某些参数而解决。 Please first check it out to see if the suggestions there solve your problem.
请先检查一下,看看那里的建议是否可以解决您的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.