简体   繁体   English

LDA Gensim Word - >主题ID分布而不是主题 - >单词分发

[英]LDA Gensim Word -> Topic Ids Distribution instead of Topic -> Word Distribution

i am trying to implement Topic Tiling algorithm on my trained lda model. 我正在尝试在我训练的lda模型上实现Topic Tiling算法。 For the algorithm I need all of the IDs that are assigned to a single word in an unseen document. 对于算法,我需要在看不见的文档中分配给单个单词的所有ID。 I will then calculate the most frequent topic id for the given word and assign it as the mode of that word. 然后,我将计算给定单词的最常见主题ID,并将其指定为该单词的模式。

I am using the gensim lib so it is very easy to get topic->word dist, where the words are given with their probabilities. 我正在使用gensim lib,因此很容易获得topic-> word dist,其中的单词以其概率给出。 However how do I get "what topic(s) are/were assigned to a single world", meaning word->topic dists. 但是,我如何获得“分配给单个世界的主题”,意思是单词 - >主题列表。

Example:
s = "Banks are closed on Sunday"

Topic -> Word Dist from Gensim:
TopicTag -> Prob*Word
Topic 0 -> 0,3*Bank, 0,2*are
Topic 1 -> 0,2*closed, 0,1*Sunday
Topic 2 -> 0,4*Sunday, 0,3*on

What I want:
word -> TopicTag(Frequency that given word was assigned with the specified topic tag)
Banks -> Topic1(2), Topic2(2)
Closed -> Topic0(1),Topic1 (4)

Please also note that I am not interested in parsing the Topic -> Word Dist results from Gensim, I am interested in finding an accurate way that my model assigns (numerous) topic(s) to each individual word that will come in an unseen document. 还请注意,我对解析Gensim的主题 - > Word Dist结果不感兴趣,我有兴趣找到一种准确的方法,即我的模型将(众多)主题分配给每个单独的单词,这些主题将出现在一个看不见的文档中。

Thanks in advance. 提前致谢。

You can get the matrix of word-topic weights from lda_model.get_lambda() . 您可以从lda_model.get_lambda()获取单词主题权重矩阵。 See also this mailing list thread: https://groups.google.com/d/msg/gensim/6N9-Y5KVQu0/soFqkEopMWgJ 另请参阅此邮件列表主题: https//groups.google.com/d/msg/gensim/6N9-Y5KVQu0/soFqkEopMWgJ

I am also interested in knowing the answer. 我也有兴趣知道答案。 Although, you can get Topic -> Word Dist without parsing by: 虽然,您可以通过以下方式获取主题 - > Word Dist而不进行解析:

y = ldavar.state.getlambda()
for i in range(y.shape[0]):
    y[i] = y[i] / y[i].sum()

Now each row of y will give you word distribution for a topic 现在,y的每一行都会为您提供主题的单词分配

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM