简体   繁体   English

Gensim主题打印错误/问题

[英]Gensim topic printing errors/issues

All, 所有,

This is a re-post to what I responded to over in this thread . 这是对我在此线程中所做的答复的重新发布。 I am getting some totally screwy results with trying to print LSI topics in gensim. 尝试在gensim中打印LSI主题时,我得到一些完全错误的结果。 Here is my code: 这是我的代码:

try:
    from gensim import corpora, models
except ImportError as err:
    print err

class LSI:
    def topics(self, corpus):
        tfidf = models.TfidfModel(corpus)
        corpus_tfidf = tfidf[corpus]
        dictionary = corpora.Dictionary(corpus)
        lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
        print lsi.show_topics()

if __name__ == '__main__':
    data = '../data/data.txt'
    corpus = corpora.textcorpus.TextCorpus(data)
    LSI().topics(corpus)

This prints the following to the console. 这会将以下内容打印到控制台。

-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)" + ......

I would like to be able to print out the topics like @2er0 did over here but I am getting results like these. 我希望能够像在这里打印出@ 2er0这样的主题但是我得到了类似的结果。 See below and note that the second item that is printed is a tuple and I have no idea where it came from. 参见下文,请注意,第二个打印的项目是元组,我不知道它来自何处。 data.txt is a text file with several paragraphs in it. data.txt是一个文本文件,其中包含多个段落。 That is all. 就这些。

Any thoughts on this would be fantastic! 关于此的任何想法都太棒了! Adam 亚当

To answer why your LSI topics are tuples instead of words, check your input corpus. 要回答为什么LSI主题是元组而不是单词的原因,请检查您的输入语料库。

is it created from a list of documents that is converted into corpus through corpus = [dictionary.doc2bow(text) for text in texts] ? 它是由文档列表创建的,该文档列表通过corpus = [dictionary.doc2bow(text) for text in texts]转换为语料库?

Because if it isn't and you just read it from serialized corpus without reading a dictionary, then you wont get the words in the topic outputs. 因为如果不是这样,而您只是从序列化语料库中读取而不阅读字典,那么您将不会在主题输出中得到单词。

Below my code works and prints out the topics with weighted words: 在我的代码下面,使用加权词打印出主题:

import gensim as gs

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gs.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = gs.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi.print_topics()

for i in lsi.print_topics():
    print i

The above outputs: 以上输出:

-0.331*"system" + -0.329*"a" + -0.329*"survey" + -0.241*"user" + -0.234*"minors" + -0.217*"opinion" + -0.215*"eps" + -0.212*"graph" + -0.205*"response" + -0.205*"time"
-0.330*"minors" + 0.313*"eps" + 0.301*"system" + -0.288*"graph" + -0.274*"a" + -0.274*"survey" + 0.268*"management" + 0.262*"interface" + 0.208*"human" + 0.189*"engineering"
0.282*"trees" + 0.267*"the" + 0.236*"in" + 0.236*"paths" + 0.236*"intersection" + -0.233*"time" + -0.233*"response" + 0.202*"generation" + 0.202*"unordered" + 0.202*"binary"
-0.247*"generation" + -0.247*"unordered" + -0.247*"random" + -0.247*"binary" + 0.219*"minors" + -0.214*"the" + -0.214*"to" + -0.214*"error" + -0.214*"perceived" + -0.214*"relation"
0.333*"machine" + 0.333*"for" + 0.333*"lab" + 0.333*"abc" + 0.333*"applications" + 0.258*"computer" + -0.214*"system" + -0.194*"eps" + -0.191*"and" + -0.188*"testing"

It looks ugly but this does the job (just a purely string based approach): 它看起来很丑,但这确实可以做到(只是一种基于字符串的方法):

#x = lsi.show_topics()
x = '-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)"'
y = [(j.split("*")[0], (j.split("*")[1].split(",")[0].lstrip('"('), j.split("*")[1].split(",")[1].strip().rstrip(')"'))) for j in [i for i in x.strip().split(" + ")]]

for i in y:
  print y

The above outputs: 以上输出:

('-0.804', ('5', '1'))
('-0.246', ('856', '1'))
('-0.227', ('145', '1'))

If not you can try lsi.print_topic(i) instead of lsi.show_topics() 如果没有,您可以尝试使用lsi.print_topic(i)而不是lsi.show_topics()

for i in range(len(lsi.show_topics())):
  print lsi.print_topic(i)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM