简体   繁体   English

使用gensim进行LDA主题建模的python IndexError

[英]python IndexError using gensim for LDA Topic Modeling

Another thread has a similar question to mine but leaves out reproducible code. 另一个线程有一个类似的问题要解决,但省略了可重复的代码。

The goal with the script in question is to create a process that is as memory efficient as possible. 该脚本的目标是创建一个尽可能提高内存效率的进程。 So I tried to write a the class corpus() to take advantage of gensims' capabilities. 因此,我尝试编写一个类corpus()来利用gensims的功能。 However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . 但是,我遇到一个IndexError,我不确定在创建lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))时如何解决。

The documents that I am using are the same as used in the gensim tutorial, which I placed into tutorial_example.txt: 我使用的文档与gensim教程中使用的文档相同,我将它们放在了tutorial_example.txt中:

$ cat tutorial_example.txt 
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

Error received 收到错误

$./gensim_topic_modeling.py -mn2 -w'english' -l1 tutorial_example.txt 
Traceback (most recent call last):
  File "./gensim_topic_modeling.py", line 98, in <module>
    lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 306, in __init__
    self.update(corpus)
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 543, in update
    self.log_perplexity(chunk, total_docs=lencorpus)
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 454, in log_perplexity
    perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 630, in bound
    gammad, _ = self.inference([doc])
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 366, in inference
    expElogbetad = self.expElogbeta[:, ids]
IndexError: index 7 is out of bounds for axis 1 with size 7

Below is the gensim_topic_modeling.py script: 以下是gensim_topic_modeling.py脚本:

##gensim_topic_modeling.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import re
import codecs
import logging
import fileinput
from operator import *
from itertools import *
from sklearn.cluster import KMeans
from gensim import corpora, models, similarities, matutils
import argparse
from nltk.corpus import stopwords

reload(sys)
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stdin = codecs.getreader('utf-8')(sys.stdin)


##defs

def stop_word_gen():
    nltk_langs=['danish', 'dutch', 'english', 'french', 'german', 'italian','norwegian', 'portuguese', 'russian', 'spanish', 'swedish']
    stoplist = []
    for lang in options.stop_langs.split(","):
        if lang not in nltk_langs:
            sys.stderr.write('\n'+"Language {0} not supported".format(lang)+'\n')
            continue
        stoplist.extend(stopwords.words(lang))
    return stoplist


def clean_texts(texts):
    # remove tokens that appear only once
    all_tokens = sum(texts, [])
    tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
    return [[word for word in text if word not in tokens_once] for text in texts]

##class

class corpus(object):
    """sparse vector matrix and dictionary"""
    def __iter__(self):
        first=True
        for line in fileinput.FileInput(options.input, openhook=fileinput.hook_encoded("utf-8")):
            # assume there's one document per line; tokenizer option determines how to split
            if options.space_tokenizer:
                rl = re.compile('\s+', re.UNICODE).split(unicode(line,'utf-8'))
            else:
                rl = re.compile('\W+', re.UNICODE).split(tagRE.sub(' ',line)) 
            # create dictionary
            tokens=[token.strip().lower() for token in rl if token != '' and token.strip().lower() not in stoplist]
            if first:
                first=False
                self.dictionary=corpora.Dictionary([tokens])
            else:
                self.dictionary.add_documents([tokens])
                self.dictionary.compactify
            yield self.dictionary.doc2bow(tokens)


##main 

if __name__ == '__main__':
    ##parser
    parser = argparse.ArgumentParser(
                description="Topic model from a column of text.  Each line is a document in the corpus")
    parser.add_argument("input", metavar="args")
    parser.add_argument("-l", "--document-frequency-limit", dest="doc_freq_limit", default=1,
                help="Remove all tokens less than or equal to limit (default 1)")
    parser.add_argument("-m", "--create-model", dest="create_model", default=False, action="store_true",
                help="Create and save a model from existing dictionary and input corpus.")
    parser.add_argument("-n", "--number-of-topics", dest="number_of_topics", default=2,
                help="Number of topics (default 2)")
    parser.add_argument("-t", "--space-tokenizer", dest="space_tokenizer", default=False, action="store_true", 
                help="Use alternate whitespace tokenizer")
    parser.add_argument("-w", "--stop-word-languages", dest="stop_langs", default="danish,dutch,english,french,german,italian,norwegian,portuguese,russian,spanish,swedish",
                help="Desired languages for stopword lists")
    options = parser.parse_args()

    ##globals

    stoplist=set(stop_word_gen())  
    tagRE = re.compile(r'<.*?>', re.UNICODE)    # Remove xml/html tags
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename="topic-modeling-log")
    logr = logging.getLogger("topic_model")
    logr.info("#"*15 + " started " + "#"*15)

    ##instance of class 

    checker=corpus()
    logr.info("#"*15 + " SPARSE MATRIX (pre-filter)" + "#"*15)

    ##view sparse matrix and dictionary

    for vector in checker: 
        logr.info(vector)
    logr.info("#"*15 + " DICTIONARY (pre-filter)" + "#"*15)
    logr.info(checker.dictionary)
    logr.info(checker.dictionary.token2id)
    #filter
    checker.dictionary.filter_extremes(no_below=int(options.doc_freq_limit)+1)
    logr.info("#"*15 + " DICTIONARY (post-filter)" + "#"*15)
    logr.info(checker.dictionary)
    logr.info(checker.dictionary.token2id)

    ##Create lda model

    if options.create_model:     
        tfidf = models.TfidfModel(checker,normalize=False)
        print tfidf
        logr.info("#"*15 + " corpus_tfidf " + "#"*15)
        corpus_tfidf = tfidf[checker]
        logr.info("#"*15 + " lda " + "#"*15)
        lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
        logr.info("#"*15 + " corpus_lda " + "#"*15)
        corpus_lda = lda[corpus_tfidf] 

        ##Evaluate topics based on threshold

        scores = list(chain(*[[score for topic,score in topic] \
                      for topic in [doc for doc in corpus_lda]]))
        threshold = sum(scores)/len(scores)
        print "threshold:",threshold
        print
        cluster1 = [j for i,j in zip(corpus_lda,documents) if i[0][1] > threshold]
        cluster2 = [j for i,j in zip(corpus_lda,documents) if i[1][1] > threshold]
        cluster3 = [j for i,j in zip(corpus_lda,documents) if i[2][1] > threshold]

The resulting topic-modeling-log file is below. 生成的topic-modeling-log文件如下。 Thanks in advance for any help! 在此先感谢您的帮助!

topic-modeling-log 主题建模日志

2014-05-25 02:58:50,482 : INFO : ############### started ###############
2014-05-25 02:58:50,483 : INFO : ############### SPARSE MATRIX (pre-filter)###############
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,483 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,483 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (27, 1), (28, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(9, 1), (26, 1), (30, 1)]
2014-05-25 02:58:50,485 : INFO : ############### DICTIONARY (pre-filter)###############
2014-05-25 02:58:50,485 : INFO : Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : {'minors': 30, 'generation': 22, 'testing': 16, 'iv': 29, 'engineering': 15, 'computer': 2, 'relation': 20, 'human': 3, 'measurement': 18, 'unordered': 25, 'binary': 21, 'abc': 0, 'ordering': 31, 'graph': 26, 'system': 10, 'machine': 6, 'quasi': 32, 'random': 23, 'paths': 28, 'error': 17, 'trees': 24, 'lab': 5, 'applications': 1, 'management': 14, 'user': 12, 'interface': 4, 'intersection': 27, 'response': 8, 'perceived': 19, 'widths': 34, 'well': 33, 'eps': 13, 'survey': 9, 'time': 11, 'opinion': 7}
2014-05-25 02:58:50,486 : INFO : keeping 12 tokens which were in no less than 2 and no more than 4 (=50.0%) documents
2014-05-25 02:58:50,486 : INFO : resulting dictionary: Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : ############### DICTIONARY (post-filter)###############
2014-05-25 02:58:50,486 : INFO : Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : {'minors': 0, 'graph': 1, 'system': 2, 'trees': 3, 'eps': 4, 'computer': 5, 'survey': 6, 'user': 7, 'human': 8, 'time': 9, 'interface': 10, 'response': 11}
2014-05-25 02:58:50,486 : INFO : collecting document frequencies
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,486 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,486 : INFO : PROGRESS: processing document #0
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,486 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,488 : INFO : calculating IDF weights for 9 documents and 34 features (51 matrix non-zeros)
2014-05-25 02:58:50,488 : INFO : ############### corpus_tfidf ###############
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,488 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : ############### lda ###############
2014-05-25 02:58:50,489 : INFO : using symmetric alpha at 0.5
2014-05-25 02:58:50,489 : INFO : using serial LDA version on this node
2014-05-25 02:58:50,489 : WARNING : input corpus stream has no len(); counting documents
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,489 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,489 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,491 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50 with a convergence threshold of 0
2014-05-25 02:58:50,491 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,491 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)

This is caused by using a corpus and dictionary that don't have the same id-to-word mapping. 这是由于使用没有相同的ID到单词映射的语料库字典引起的。 It can happen if you prune your dictionary and call dictionary.compactify() at the wrong time. 如果您修剪字典并在错误的时间调用dictionary.compactify() ,则可能会发生这种情况。

A simple example will make it clear. 一个简单的例子将使其清楚。 Let's make a dictionary: 我们来做个字典:

from gensim.corpora.dictionary import Dictionary
documents = [
    ['here', 'is', 'one', 'document'],
    ['here', 'is', 'another', 'document'],
]
dictionary = Dictionary()
dictionary.add_documents(documents)

This dictionary now has entries for these words and maps them to integer id's. 现在,该词典中有这些单词的条目,并将它们映射到整数id。 It's useful to turn documents into vectors of (id, count) tuples (which we'd want to do before passing them into a model): 将文档转换为(id, count)元组的向量很有用(id, count)在将它们传递到模型之前我们要这样做):

vectorized_corpus = [dictionary.doc2bow(doc) for doc in corpus]

Sometimes you'll want to alter your dictionary. 有时您会想要更改字典。 For example, you might want to remove very rare, or very common words: 例如,您可能想要删除非常罕见或非常常见的词:

dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000)
dictionary.compactify()

Removing words creates gaps in the dictionary, but calling dictionary.compactify() re-assigns ids to fill in the gaps. 删除单词会在字典中产生空隙,但是调用dictionary.compactify()重新分配ID来填补空隙。 But that means our vectorized_corpus from above doesn't use the same id's as the dictionary any more, and if we pass them into a model, we'll get an IndexError . 但这意味着我们上面的vectorized_corpus不再使用与dictionary相同的ID,如果将它们传递给模型,则会得到IndexError

Solution : make your vector representation using the dictionary after making changes and calling dictionary.compactify() ! 解决方案 :在进行更改并调用dictionary.compactify() 之后 ,使用字典进行矢量表示!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM