简体繁体 English

如何测试文本集群应用程序？

[英]How to test a text clustering application?

原文 2014-01-10 01:28:24 5 1 nlp/ data-mining/ text-mining

I am developing an application to cluster documents according to their topics. 我正在开发一个应用程序，以根据文档主题对文档进行聚类。 I am using the LDA (Latent Dirichlet Allocation) algorithm. 我正在使用LDA（潜在Dirichlet分配）算法。 Now the prototype is ready and there are some results. 现在原型已经准备就绪，并且有一些结果。

I am looking for a reasonable way to test it. 我正在寻找测试它的合理方法。 My current approach is to print out the topics and some of their related documents respectively. 我当前的方法是分别打印出主题及其一些相关文档。 And manually evaluate them. 并手动评估它们。 I can think of the following test points: 我可以想到以下测试点：

The documents within a topic are on that topic indeed. 主题中的文档确实是关于该主题的。
The topics are substantially different from each other. 主题彼此之间有很大不同。

Is there any best practice to do this? 是否有最佳做法来做到这一点？ Is there any objective metric for this rather than my subjective evaluation? 除了我的主观评估，还有其他客观指标吗？

1 个解决方案

1.after training, we get the topic word matrix P(z|w) , every row is the word's prob assign to the topic, so you can print out the top N words of every topic,and eval them , it would be easy comparing to eval topic with document 1.训练后，我们得到主题词矩阵P（z | w），每一行都是分配给该主题的词的概率，因此您可以打印出每个主题的前N个词，并进行评估，这很容易与文档的评估主题进行比较

2.I think the problem you are asking here is whether the training has converged,I simply eval the P(z|w) ,when the P(z|w) is stable , it means model converge at the param (alpha,beta,topic_num)we choose. 2.我认为您要问的问题是训练是否收敛，我简单地评估P（z | w），当P（z | w）稳定时，意味着模型收敛于参数（alpha，beta ，topic_num）我们选择。 and when we tune the topic num , we can get the stable P(z|w) respect to all the topic_num, we choose topic_num respect to the max P(z|w) . 调整主题num时，可以获得相对于所有topic_num的稳定P（z | w），相对于最大值P（z | w）可以选择topic_num。 you can refer to the paper http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf 您可以参考论文http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf

3.as to how to tune alpha beta, and effcient way to tune topic_num , Hanna M. Wallach do a lot of research about that,I simply do this by intuition,since the dataset is too large http://people.cs.umass.edu/~wallach/ 3.关于如何调整alpha beta以及如何有效地调整topic_num的方法，汉娜·沃拉克（Hanna M. Wallach）对此进行了大量研究，我只是凭直觉进行了此操作，因为数据集太大http：//people.cs。 umass.edu/~wallach/