简体   繁体   中英

Latent Semantic Indexation with gensim

In order to use the Latent semantic indexation method from gensim, I want to begin with a small "classique" example like :

import logging, gensim, bz2
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
etc..

My question is : How to get the corpus iterator 'wiki_en_tfidf.mm' ? Must I download it from somewhere ? I have searched on the Internet but I did not find anything. Help please ?

The first page of search results includes a link to:

https://radimrehurek.com/gensim/wiki.html

which says "First let's load the corpus iterator and dictionary, created in the second step above."

Step 2 is

  1. Convert the articles to plain text (process Wiki markup) and store the result as sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don't even need to uncompress the whole archive to disk. There is a script included in gensim that does just that, run:

    $ python -m gensim.scripts.make_wiki

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM