简体   繁体   中英

Python LDA Gensim model with over 20 topics does not print properly

Using the Gensim package (both LDA and Mallet), I noticed that when I create a model with more than 20 topics, and I use the print_topics function, it will print a maximum of 20 topics (note, not the first 20 topics, rather any 20 topics), and they will be out of order.

And so my question is, how do i get all of the topics to print? I am unsure if this is a bug or an issue on my end. I have looked back at my library of LDA models (over 5000, different data sources), and have noted this happens in all of them where topics are above 20.

Below is sample code with output. In the output, you will see the topics are not ordered (they should be) and topics are missing such as topic 3.

lda_model = gensim.models.ldamodel.LdaModel(corpus=jr_dict_corpus,
                                           id2word=jr_dict,
                                           num_topics=25, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

pprint(lda_model.print_topics())
#note, if the model contained 20 topics, the topics would be listed in order 0-19
[(21,
  '0.001*"commitment" + 0.001*"study" + 0.001*"evolve" + 0.001*"outlook" + '
  '0.001*"value" + 0.001*"people" + 0.001*"individual" + 0.001*"client" + '
  '0.001*"structure" + 0.001*"proposal"'),
 (18,
  '0.001*"self" + 0.001*"insurance" + 0.001*"need" + 0.001*"trend" + '
  '0.001*"statistic" + 0.001*"propose" + 0.001*"analysis" + 0.001*"perform" + '
  '0.001*"impact" + 0.001*"awareness"'),
 (2,
  '0.001*"link" + 0.001*"task" + 0.001*"collegiate" + 0.001*"universitie" + '
  '0.001*"banking" + 0.001*"origination" + 0.001*"security" + 0.001*"standard" '
  '+ 0.001*"qualifications_bachelor" + 0.001*"greenfield"'),
 (11,
  '0.024*"collegiate" + 0.016*"interpersonal" + 0.016*"prepare" + '
  '0.016*"invite" + 0.016*"aspect" + 0.016*"college" + 0.016*"statistic" + '
  '0.016*"continent" + 0.016*"structure" + 0.016*"project"'),
 (10,
  '0.049*"enjoy" + 0.049*"ambiguity" + 0.017*"accordance" + 0.017*"liberalize" '
  '+ 0.017*"developing" + 0.017*"application" + 0.017*"vacancie" + '
  '0.017*"service" + 0.017*"initiative" + 0.017*"discontinuing"'),
 (20,
  '0.028*"negotiation" + 0.028*"desk" + 0.018*"enhance" + 0.018*"engage" + '
  '0.018*"discussion" + 0.018*"ability" + 0.018*"depth" + 0.018*"derive" + '
  '0.018*"enjoy" + 0.018*"balance"'),
 (12,
  '0.036*"individual" + 0.024*"validate" + 0.018*"greenfield" + '
  '0.018*"capability" + 0.018*"coordinate" + 0.018*"create" + '
  '0.018*"programming" + 0.018*"safety" + 0.010*"evaluation" + '
  '0.002*"reliability"'),
 (1,
  '0.028*"negotiation" + 0.021*"responsibility" + 0.014*"master" + '
  '0.014*"mind" + 0.014*"experience" + 0.014*"worker" + 0.014*"ability" + '
  '0.007*"summary" + 0.007*"proposal" + 0.007*"alert"'),
 (23,
  '0.043*"banking" + 0.026*"origination" + 0.026*"round" + 0.026*"credibility" '
  '+ 0.026*"entity" + 0.018*"standard" + 0.017*"range" + 0.017*"pension" + '
  '0.017*"adapt" + 0.017*"information"'),
 (13,
  '0.034*"priority" + 0.034*"reconciliation" + 0.034*"purchaser" + '
  '0.023*"reporting" + 0.023*"offer" + 0.023*"investor" + 0.023*"share" + '
  '0.023*"region" + 0.023*"service" + 0.023*"manipulate"'),
 (22,
  '0.017*"analyst" + 0.017*"modelling" + 0.016*"producer" + 0.016*"return" + '
  '0.016*"self" + 0.009*"scope" + 0.008*"mind" + 0.008*"need" + 0.008*"detail" '
  '+ 0.008*"statistic"'),
 (9,
  '0.021*"decision" + 0.014*"invite" + 0.014*"balance" + 0.014*"commercialize" '
  '+ 0.014*"transform" + 0.014*"manage" + 0.014*"optionality" + '
  '0.014*"problem_solving" + 0.014*"fuel" + 0.014*"stay"'),
 (7,
  '0.032*"commitment" + 0.032*"study" + 0.016*"impact" + 0.016*"outlook" + '
  '0.011*"operation" + 0.011*"expand" + 0.011*"exchange" + 0.011*"management" '
  '+ 0.011*"conde" + 0.011*"evolve"'),
 (15,
  '0.032*"agility" + 0.019*"feasibility" + 0.019*"self" + 0.014*"deploy" + '
  '0.014*"define" + 0.013*"investment" + 0.013*"option" + 0.013*"control" + '
  '0.013*"action" + 0.013*"incubation"'),
 (5,
  '0.020*"desk" + 0.018*"agility" + 0.016*"vender" + 0.016*"coordinate" + '
  '0.016*"committee" + 0.012*"acquisition" + 0.012*"target" + '
  '0.012*"counterparty" + 0.012*"approval" + 0.012*"trend"'),
 (17,
  '0.022*"option" + 0.017*"working" + 0.017*"niche" + 0.011*"business" + '
  '0.011*"constrain" + 0.011*"meeting" + 0.011*"correspond" + 0.011*"exposure" '
  '+ 0.011*"element" + 0.011*"face"'),
 (0,
  '0.025*"expertise" + 0.025*"banking" + 0.021*"universitie" + '
  '0.017*"spreadsheet" + 0.013*"negotiation" + 0.013*"shipment" + '
  '0.013*"arise" + 0.013*"billing" + 0.013*"assistance" + 0.013*"sector"'),
 (4,
  '0.024*"provide" + 0.017*"consider" + 0.017*"allow" + 0.015*"outlook" + '
  '0.015*"value" + 0.015*"contract" + 0.012*"study" + 0.012*"technology" + '
  '0.012*"scenario" + 0.012*"indicator"'),
 (6,
  '0.058*"impulse" + 0.027*"shall" + 0.027*"shape" + 0.024*"marketer" + '
  '0.017*"availability" + 0.014*"determine" + 0.014*"load" + '
  '0.014*"constantly_change" + 0.014*"instrument" + 0.014*"interface"'),
 (19,
  '0.042*"task" + 0.038*"tariff" + 0.038*"recommend" + 0.024*"example" + '
  '0.023*"future" + 0.021*"people" + 0.021*"math" + 0.021*"capacity" + '
  '0.021*"spirit" + 0.020*"price"')]

Same model as above, but using 20 topics. As you can see, the output is in order by topic # and it contains all the topics.

lda_model = gensim.models.ldamodel.LdaModel(corpus=jr_dict_corpus,
                                           id2word=jr_dict,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

pprint(lda_model.print_topics())

[(0,
  '0.031*"enjoy" + 0.031*"ambiguity" + 0.028*"accordance" + 0.016*"statistic" '
  '+ 0.016*"initiative" + 0.016*"service" + 0.016*"liberalize" + '
  '0.016*"application" + 0.011*"community" + 0.011*"identifie"'),
 (1,
  '0.016*"transformation" + 0.016*"negotiation" + 0.016*"community" + '
  '0.016*"clock" + 0.011*"marketer" + 0.011*"desk" + 0.011*"mandate" + '
  '0.011*"closing" + 0.011*"initiative" + 0.011*"experience"'),
 (2,
  '0.026*"priority" + 0.026*"reconciliation" + 0.026*"purchaser" + '
  '0.020*"safety" + 0.020*"region" + 0.020*"query" + 0.020*"share" + '
  '0.020*"manipulate" + 0.020*"ibex" + 0.020*"investor"'),
 (3,
  '0.022*"improve" + 0.021*"committee" + 0.021*"affect" + 0.012*"target" + '
  '0.012*"acquisition" + 0.011*"basis" + 0.011*"profitability" + '
  '0.011*"economic" + 0.011*"natural" + 0.011*"profit"'),
 (4,
  '0.024*"provide" + 0.019*"value" + 0.017*"consider" + 0.017*"allow" + '
  '0.015*"scenario" + 0.015*"outlook" + 0.015*"contract" + 0.014*"forecast" + '
  '0.014*"decision" + 0.012*"indicator"'),
 (5,
  '0.037*"desk" + 0.030*"coordinate" + 0.030*"agility" + 0.030*"vender" + '
  '0.023*"counterparty" + 0.023*"immature_emerge" + 0.023*"metric" + '
  '0.022*"approval" + 0.015*"maximization" + 0.015*"undergraduate"'),
 (6,
  '0.053*"impulse" + 0.025*"shall" + 0.025*"shape" + 0.018*"availability" + '
  '0.018*"marketer" + 0.012*"determine" + 0.012*"language" + '
  '0.012*"monitoring" + 0.012*"integration" + 0.012*"month"'),
 (7,
  '0.026*"commitment" + 0.026*"study" + 0.013*"impact" + 0.013*"outlook" + '
  '0.009*"operation" + 0.009*"management" + 0.009*"expand" + 0.009*"exchange" '
  '+ 0.009*"conde" + 0.009*"balance"'),
 (8,
  '0.057*"insurance" + 0.029*"propose" + 0.028*"rule" + 0.026*"self" + '
  '0.023*"product" + 0.023*"asset" + 0.023*"pricing" + 0.023*"amount" + '
  '0.023*"result" + 0.020*"liquidity"'),
 (9,
  '0.012*"universitie" + 0.012*"need" + 0.012*"statistic" + 0.012*"trend" + '
  '0.008*"invite" + 0.008*"commercialize" + 0.008*"transform" + 0.008*"manage" '
  '+ 0.008*"problem_solving" + 0.008*"optionality"'),
 (10,
  '0.024*"background" + 0.024*"curve" + 0.020*"allow" + 0.019*"collect" + '
  '0.019*"basis" + 0.017*"accordance" + 0.013*"improve" + 0.013*"datum" + '
  '0.013*"component" + 0.013*"reliability"'),
 (11,
  '0.054*"task" + 0.049*"tariff" + 0.049*"recommend" + 0.031*"future" + '
  '0.027*"spirit" + 0.027*"capacity" + 0.027*"math" + 0.022*"ensure" + '
  '0.022*"profit" + 0.022*"variable_margin"'),
 (12,
  '0.001*"impulse" + 0.001*"availability" + 0.001*"reliability" + '
  '0.001*"shall" + 0.001*"component" + 0.001*"agent" + 0.001*"marketer" + '
  '0.001*"shape" + 0.001*"assisting" + 0.001*"supply"'),
 (13,
  '0.021*"region" + 0.016*"greenfield" + 0.016*"collegiate" + 0.011*"transfer" '
  '+ 0.011*"remuneration" + 0.011*"organization" + 0.011*"structure" + '
  '0.011*"continent" + 0.011*"project" + 0.011*"prepare"'),
 (14,
  '0.033*"originator" + 0.025*"vender" + 0.025*"expertise" + 0.025*"banking" + '
  '0.019*"evolve" + 0.017*"management" + 0.017*"market" + 0.017*"site" + '
  '0.012*"component" + 0.012*"discontinuing"'),
 (15,
  '0.027*"agility" + 0.022*"mind" + 0.022*"negotiation" + 0.011*"deploy" + '
  '0.011*"define" + 0.011*"ecosystem" + 0.011*"control" + 0.011*"lead" + '
  '0.011*"industry" + 0.011*"option"'),
 (16,
  '0.001*"region" + 0.001*"master" + 0.001*"orginiation" + 0.001*"greenfield" '
  '+ 0.001*"agent" + 0.001*"identifie" + 0.001*"remuneration" + 0.001*"mark" + '
  '0.001*"reviewing" + 0.001*"closing"'),
 (17,
  '0.030*"banking" + 0.018*"option" + 0.018*"round" + 0.018*"credibility" + '
  '0.018*"origination" + 0.018*"entity" + 0.016*"working" + 0.015*"niche" + '
  '0.015*"standard" + 0.012*"coordinate"'),
 (18,
  '0.027*"negotiation" + 0.018*"reporting" + 0.018*"perform" + 0.018*"world" + '
  '0.015*"offer" + 0.015*"manipulate" + 0.011*"query" + 0.010*"control" + '
  '0.010*"working" + 0.009*"self"'),
 (19,
  '0.047*"example" + 0.039*"people" + 0.039*"price" + 0.039*"excel" + '
  '0.039*"excellent" + 0.038*"base" + 0.031*"office" + 0.031*"optimizing" + '
  '0.031*"participate" + 0.031*"package"')]

print_topics 的默认主题数为 20。您必须使用 num_topics 参数来包含超过 20 的主题...

print(lda_model.print_topics(num_topics=25, num_words=10))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM