文本分类和主题建模

Question

For a huge set of articles, I want to get the topic models with weightage assigned to different topics & within topics, what are the weightage for different sub-topics.对于大量文章，我想获得分配给不同主题和主题内权重的主题模型，不同子主题的权重是多少。 For example, if I feed an article which falls in both Business & Technology domain, then the program's output shuold be something like this :-例如，如果我提供了一篇同时属于商业和技术领域的文章，那么程序的输出应该是这样的：-

0.593 Business ( 0.438 - Marketing , 0.375 - Companies, 0.062 - Office Work) 0.593 商业（0.438 - 市场营销，0.375 - 公司，0.062 - 办公室工作）
0.148 Technology ( 0.500 Technology by type, 0.250 - High_technology Business Districts, 0.250 - Technology Companies) 0.148 技术（0.500 技术类型，0.250 - 高科技商业区，0.250 - 科技公司）
0.111 Society ( 0.333 - Organizations, 0.333 - Technology in Society, 0.333 - Labor) 0.111 社会（0.333 - 组织，0.333 - 社会中的技术，0.333 - 劳工）

What's the best open-source language processing programs available that can successfully do this stuff?可以成功完成这些工作的最佳开源语言处理程序是什么？

Answer 1

您可以使用开源NLTK Toolkit进行分类。

Answer 2

I would give NLTK a try, but scikit-learn, even though it has a steeper learning curve than NLTK, is probably a better bet.我会尝试 NLTK，但是 scikit-learn，尽管它的学习曲线比 NLTK 更陡峭，但可能是更好的选择。 It's much more configurable.它的可配置性要强得多。

http://scikit-learn.org/stable/documentation.html http://scikit-learn.org/stable/documentation.html

Answer 3

There are several programs to do a part of this task, for a starter I recommend mallet .有几个程序可以完成这项任务的一部分，对于初学者，我推荐mallet 。 Note that any topic modeling program gives you the topics in the form you want, ie,请注意，任何主题建模程序都会以您想要的形式为您提供主题，即，

 ( 0.438 - Marketing , 0.375 - Companies, 0.062 - Office Work)

but the labels (in this example Business ) you need to assign yourself.但是您需要自己分配标签（在本例中为Business ）。 Mallet also gives you a decomposition of the text to the topics (identified by numbers, not by the labels). Mallet 还为您提供了文本到主题的分解（由数字标识，而不是由标签标识）。

文本分类和主题建模

问题描述

3 个解决方案

解决方案1
0 2015-06-16 13:28:54

解决方案2
0 2015-06-16 14:56:27

解决方案3
0 2015-06-18 09:21:11

文本分类和主题建模

问题描述

3 个解决方案

解决方案1 0 2015-06-16 13:28:54

解决方案2 0 2015-06-16 14:56:27

解决方案3 0 2015-06-18 09:21:11

解决方案1
0 2015-06-16 13:28:54

解决方案2
0 2015-06-16 14:56:27

解决方案3
0 2015-06-18 09:21:11