简体   繁体   English

如何使用gensim使用训练有素的LDA模型预测新查询的主题?

[英]How to predict the topic of a new query using a trained LDA model using gensim?

I have trained a corpus for LDA topic modelling using gensim. 我使用gensim训练了一个用于LDA主题建模的语料库。

Going through the tutorial on the gensim website (this is not the whole code): 浏览gensim网站上的教程(这不是整个代码):

question = 'Changelog generation from Github issues?';

temp = question.lower()
for i in range(len(punctuation_string)):
    temp = temp.replace(punctuation_string[i], '')

words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]

This is the output that I get: 这是我得到的输出:

['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]

I don't know how the last output is going to help me find the possible topic for the question !!! 我不知道最后的输出将如何帮助我找到了可能的题目question

Please help! 请帮忙!

I have written a function in python that gives the possible topic for a new query: 我在python中编写了一个函数,为新查询提供了可能的主题:

def getTopicForQuery (question):
    temp = question.lower()
    for i in range(len(punctuation_string)):
        temp = temp.replace(punctuation_string[i], '')

    words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)

    important_words = []
    important_words = filter(lambda x: x not in stoplist, words)

    dictionary = corpora.Dictionary.load('questions.dict')

    ques_vec = []
    ques_vec = dictionary.doc2bow(important_words)

    topic_vec = []
    topic_vec = lda[ques_vec]

    word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
    for i in range(len(topic_vec)):
        word_count_array[i, 0] = topic_vec[i][0]
        word_count_array[i, 1] = topic_vec[i][1]

    idx = numpy.argsort(word_count_array[:, 1])
    idx = idx[::-1]
    word_count_array = word_count_array[idx]

    final = []
    final = lda.print_topic(word_count_array[0, 0], 1)

    question_topic = final.split('*') ## as format is like "probability * topic"

    return question_topic[1]

Before going through this do refer this link! 在进行此操作之前,请参阅链接!

In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations. 在代码的初始部分,正在对查询进行预处理,以便可以删除停用词和不必要的标点符号。

Then, the dictionary that was made by using our own database is loaded. 然后,加载使用我们自己的数据库创建的字典。

We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above. 然后,我们将新查询的标记转换为单词包,然后通过topic_vec = lda[ques_vec]计算查询的主题概率分布,其中lda是训练模型,如上面提到的链接中所解释的。

The distribution is then sorted wrt the probabilities of the topics. 然后根据主题的概率对分布进行排序。 The topic with the highest probability is then displayed by question_topic[1] . 然后通过question_topic[1]显示概率最高的主题。

Assuming we just need topic with highest probability following code snippet may be helpful: 假设我们只需要具有最高概率的主题,下面的代码片段可能会有所帮助:

def findTopic(testObj, dictionary):
    text_corpus = []
    '''
     For each query ( document in the test file) , tokenize the 
     query, create a feature vector just like how it was done while training
     and create text_corpus
    '''
    for query in testObj:
        temp_doc = tokenize(query.strip())
        current_doc = []

        for word in range(len(temp_doc)):
            if temp_doc[word][0] not in stoplist and temp_doc[word][1] == 'NN':
                current_doc.append(temp_doc[word][0])

        text_corpus.append(current_doc)
    '''
     For each feature vector text, lda[doc_bow] gives the topic
     distribution, which can be sorted in descending order to print the 
     very first topic
    ''' 
    for text in text_corpus:
        doc_bow = dictionary.doc2bow(text)
        print text
        topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)
        print(topics)
        print(topics[0][0])

The tokenize functions removes punctuations/ domain specific characters to filtered and gives the list of tokens. tokenize函数删除要过滤的标点符号/域特定字符,并提供标记列表。 Here dictionary created in training is passed as parameter of the function, but it can also be loaded from a file. 这里在训练中创建的字典作为函数的参数传递,但也可以从文件加载。

Basically, Anjmesh Pandey suggested a good example code. 基本上,Anjmesh Pandey提出了一个很好的示例代码。 However the first word with highest probability in a topic may not solely represent the topic because in some cases clustered topics may have a few topics sharing those most commonly happening words with others even at the top of them. 然而,主题中具有最高概率的第一个单词可能不仅仅代表该主题,因为在某些情况下,群集主题可能具有一些主题,即使在其顶部也与其他主题共享那些最常发生的单词。 Therefore returning an index of a topic would be enough, which most likely to be close to the query. 因此,返回主题的索引就足够了,最有可能接近查询。

topic_id = sorted(lda[ques_vec], key=lambda (index, score): -score)

The transformation of ques_vec gives you per topic idea and then you would try to understand what the unlabeled topic is about by checking some words mainly contributing to the topic. ques_vec的转换为您提供了每个主题的想法,然后您将尝试通过检查主要有助于该主题的一些单词来了解未标记的主题是什么。

latent_topic_words = map(lambda (score, word):word lda.show_topic(topic_id))

show_topic() method returns a list of tuple sorted by score of each word contributing to the topic in descending order, and we can roughly understand the latent topic by checking those words with their weights. show_topic()方法返回按降序排列的每个单词得分排序的元组列表,我们可以通过检查这些单词的权重来粗略理解潜在主题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM