简体   繁体   中英

Python LSI using gensim not working

I am trying to classify emails based on the subject-line, and I have to get the LSI in order to train the classifier. I am getting tf-idf and further trying to get LSI model. However, It does not do any processing/write to any file at all. My code is as below:

#reading the list of subjects for features
f = open('subject1000.csv','rb')
f500 = open('subject500.csv','wb')

with open('subject1000.csv') as myfile:
    head=list(islice(myfile,500))#only 500 subjects for training

for h in head:
    f500.write(h)
    #print h

f500.close()    
texts = (line.lower().split() for line in head) #creating texts of subjects

dictionary = corpora.Dictionary(texts) #all the words used to create dictionary
dictionary.compactify()
print dictionary #checkpoint - 2215 unique tokens -- 2215 unique words to 1418 for 500 topics

#corpus streaming 
class MyCorpus(object):
    def __iter__(self):
        for line in open('subject500.csv','rb'): #supposed to be one document per line -- open('subject1000.csv','rb')
            yield dictionary.doc2bow(line.lower().split())  #every line - converted to bag-of-words format = list of (token_id, token_count) 2-tuples          
print 'corpus created'
corpus = MyCorpus() # object created

for vector in corpus:
    print vector

tfidf = models.TfidfModel(corpus)
corpus_tfidf= tfidf[corpus]  #re-initialize the corpus according to the model to get the normalized frequencies.
corpora.MmCorpus.serialize('subject500-tfidf', corpus_tfidf)  #store to disk for later use

print 'TFIDF complete!' #check - till here its ok

lsi300 = models.LsiModel(corpus_tfidf, num_topics=300, id2word=dictionary) #using the trained corpus to use LSI indexing
corpus_lsi300 = lsi300[corpus_tfidf]
print corpus_lsi300 #checkpoint
lsi300.print_topics(10,5) #checks
corpora.BleiCorpus.serialize('subjects500-lsi-300', corpus_lsi300)

I get the output till 'TFIDF complete!' but then the program does not return anything for LSI. I am running through 500 subject lines for the above. Any ideas on what might be going wrong will be very much appreciated! Thanks.

The logged data is as below:

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens)
INFO:gensim.corpora.dictionary:built Dictionary(1418 unique tokens) from 500 documents (total 3109 corpus positions)
DEBUG:gensim.corpora.dictionary:rebuilding dictionary, shrinking gaps
INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:calculating IDF weights for 500 documents and 1418 features (3081 matrix non-zeros)
INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to subject500-tfidf
INFO:gensim.matutils:saving sparse matrix to subject500-tfidf
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:saved 500x1418 matrix, density=0.435% (3081/709000)
DEBUG:gensim.matutils:closing subject500-tfidf
DEBUG:gensim.matutils:closing subject500-tfidf
INFO:gensim.corpora.indexedcorpus:saving MmCorpus index to subject500-tfidf.index
INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:preparing a new chunk of documents
DEBUG:gensim.models.lsimodel:converting corpus to csc format
INFO:gensim.models.lsimodel:using 100 extra samples and 2 power iterations
INFO:gensim.models.lsimodel:1st phase: constructing (1418, 400) action matrix
INFO:gensim.models.lsimodel:orthonormalizing (1418, 400) action matrix
DEBUG:gensim.matutils:computing QR of (1418, 400) dense matrix
DEBUG:gensim.models.lsimodel:running 2 power iterations
DEBUG:gensim.matutils:computing QR of (1418, 400) dense matrix
DEBUG:gensim.matutils:computing QR of (1418, 400) dense matrix
INFO:gensim.models.lsimodel:2nd phase: running dense svd on (400, 500) matrix

Add logging with

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

and paste either the log or a gist link here.

I encountered the same problem while going through the Gensim tutorials. Using a sample corpus of 2000 documents I tried to convert to LSI. Python crashes with the Windows error message "Python stopped working" at the "running dense SVD" step. It works fine with a small corpus. The problem seems to be an incorrect installation of scipy using the current binary for win32. After installing Anaconda (a python distribution that includes numpy and scipy) the problem disappeared.

I encountered a similar issue earlier this week, my model was loading correctly but printing topics wouldn't do anything. I found that it may be a bug with the behavior of print_topics() - if you run this on the command line it'll mute its output, whereas if you run this in iPython or explicitly loop through the topics for printing, you should see your results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM