简体   繁体   中英

LSI using gensim in python

I'm using Python's gensim library to do latent semantic indexing. I followed the tutorials on the website, and it works pretty well. Now I'm trying to modify it a bit; I want to be run the lsi model each time a document is added.

Here is my code:

stoplist = set('for a of the and to in'.split())
num_factors=3
corpus = []

for i in range(len(urls)):
 print "Importing", urls[i]
 doc = getwords(urls[i])
 cleandoc = [word for word in doc.lower().split() if word not in stoplist]
 if i == 0:
  dictionary = corpora.Dictionary([cleandoc])
 else:
  dictionary.addDocuments([cleandoc])
 newVec = dictionary.doc2bow(cleandoc)
 corpus.append(newVec)
 tfidf = models.TfidfModel(corpus)
 corpus_tfidf = tfidf[corpus]
 lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
 corpus_lsi = lsi[corpus_tfidf]

geturls is function I wrote that returns the contents of a website as a string. Again, it works if I wait until I process all of the documents before doing tfidf and lsi, but that's not what I want. I want to do it on each iteration. Unfortunately, I get this error:

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "streamlsa.py", line 51, in <module>
    lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 303, in __init__
    self.addDocuments(corpus)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 365, in addDocuments
    self.printTopics(5) # TODO see if printDebug works and remove one of these..
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 441, in printTopics
    self.printTopic(i, topN = numWords)))
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 433, in printTopic
    return ' + '.join(['%.3f*"%s"' % (1.0 * c[val] / norm, self.id2word[val]) for val in most])
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/corpora/dictionary.py", line 52, in __getitem__
    return self.id2token[tokenid] # will throw for non-existent ids
KeyError: 1248

Usually the error pops up on the second document. I think I understand what it's telling me (the dictionary indices are bad), I just can't figure out WHY. I've tried lots of different things and nothing seems to work. Does anyone know what's going on?

Thanks!

This was a bug in gensim, where the reverse id->word mapping gets cached, but the cache didn't get updated after addDocuments() .

It got fixed in this commit in 2011: https://github.com/piskvorky/gensim/commit/b88225cfda8570557d3c72b0820fefb48064a049 .

OK, so I found a solution, albeit not an optimal one.

If you make a dictionary with corpora.Dictionary and then immediately add documents with dictionary.addDocuments , everything works fine.

But, if you use the dictionary in between these two calls (by calling dictionary.doc2bow or attaching your dictionary to an lsi model with id2word ), then your dictionary is 'frozen' and can't be updated. You can call dictionary.addDocuments and it will tell you it's updated, and it will even tell you how big the new dictionary is, eg:

INFO:dictionary:built Dictionary(6627 unique tokens) from 8 documents (total 24054 corpus positions)

But when you reference any of the new indices, you get an error. I'm not sure if this is a bug or if this intended (for whatever reason), but at the very least the fact that gensim reports successfully adding the document to the dictionary is surely a bug.

First I tried putting any dictionary calls in separate functions, where only the local copy of the dictionary should be modified. Well, it still breaks. This is bizarre to me, and I have no idea why.

My next step was to try passing a copy of dictionary, using copy.copy . This works, but will obviously use a bit more overhead. However, it will allow you to maintain a working copy of your corpus and dictionary. The biggest drawback here though, for me, was that this solution doesn't allow me to remove words that appeared only once in the corpus using filterTokens , because that would entail modifying the dictionary.

My other solution is simply to rebuild everything (the corpus, the dictionary, the lsi and tfidf models) on each iteration. With my small sample dataset, this gives me slightly better results, but is not scaleable to very large datasets without incurring memory problems. Still, for now this is what I am doing.

If any experienced gensim users have a better (and more memory friendly) solution so that I won't run into problems with larger datasets, please let me know!

In doc2bow, you can set allow_update = True and it will automatically update your dictionary with each iteration of doc2bow

http://radimrehurek.com/gensim/corpora/dictionary.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM