[英]How to test a text clustering application?
I am developing an application to cluster documents according to their topics. 我正在开发一个应用程序,以根据文档主题对文档进行聚类。 I am using the LDA (Latent Dirichlet Allocation) algorithm.
我正在使用LDA(潜在Dirichlet分配)算法。 Now the prototype is ready and there are some results.
现在原型已经准备就绪,并且有一些结果。
I am looking for a reasonable way to test it. 我正在寻找测试它的合理方法。 My current approach is to print out the topics and some of their related documents respectively.
我当前的方法是分别打印出主题及其一些相关文档。 And manually evaluate them.
并手动评估它们。 I can think of the following test points:
我可以想到以下测试点:
Is there any best practice to do this? 是否有最佳做法来做到这一点? Is there any objective metric for this rather than my subjective evaluation?
除了我的主观评估,还有其他客观指标吗?
1.after training, we get the topic word matrix P(z|w) , every row is the word's prob assign to the topic, so you can print out the top N words of every topic,and eval them , it would be easy comparing to eval topic with document 1.训练后,我们得到主题词矩阵P(z | w),每一行都是分配给该主题的词的概率,因此您可以打印出每个主题的前N个词,并进行评估,这很容易与文档的评估主题进行比较
2.I think the problem you are asking here is whether the training has converged,I simply eval the P(z|w) ,when the P(z|w) is stable , it means model converge at the param (alpha,beta,topic_num)we choose. 2.我认为您要问的问题是训练是否收敛,我简单地评估P(z | w),当P(z | w)稳定时,意味着模型收敛于参数(alpha,beta ,topic_num)我们选择。 and when we tune the topic num , we can get the stable P(z|w) respect to all the topic_num, we choose topic_num respect to the max P(z|w) .
调整主题num时,可以获得相对于所有topic_num的稳定P(z | w),相对于最大值P(z | w)可以选择topic_num。 you can refer to the paper http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf
您可以参考论文http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf
3.as to how to tune alpha beta, and effcient way to tune topic_num , Hanna M. Wallach do a lot of research about that,I simply do this by intuition,since the dataset is too large http://people.cs.umass.edu/~wallach/ 3.关于如何调整alpha beta以及如何有效地调整topic_num的方法,汉娜·沃拉克(Hanna M. Wallach)对此进行了大量研究,我只是凭直觉进行了此操作,因为数据集太大http://people.cs。 umass.edu/~wallach/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.