[英]Automatic Topic Labeling Evaluation metric
I am trying to do a topic labeling problem on a large dataset of research papers.我正在尝试对大型研究论文数据集进行主题标记问题。 The idea is that I can give each paper a few relevant labels.
这个想法是我可以给每篇论文几个相关的标签。
I have 2 questions.我有2个问题。
I know you can do topic modeling in a variety of ways like using LDA and NMF, but what can you do to later extract possible labels from those topics?我知道您可以通过各种方式进行主题建模,例如使用 LDA 和 NMF,但是您可以做些什么来稍后从这些主题中提取可能的标签?
Also, assuming I have extracted a bunch of labels, how can I mathematically estimate their accuracy?另外,假设我提取了一堆标签,我如何从数学上估计它们的准确性? Is there some kind of metric available that can determine say, the variance of the information explained by a label in a document, or something along those lines?
是否有某种可用的度量标准可以确定文档中标签解释的信息的方差,或者类似的东西? How would I evaluate my labels without a large group of humans doing qualitative analysis?
如果没有一大群人进行定性分析,我将如何评估我的标签?
The most simple way is to use the top k words as the labels.最简单的方法是使用前k个单词作为标签。 More complicated methods include candidate label generation and candidate label ranking.
更复杂的方法包括候选标签生成和候选标签排序。 Many related papers talking about this topic:
许多相关论文都在谈论这个话题:
All the above papers have sections discussing how to evaluate the labels.以上所有论文都有部分讨论如何评估标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.