简体繁体 English

自动主题标签评估指标

[英]Automatic Topic Labeling Evaluation metric

原文 2020-03-28 00:25:50 9 1 python/ nlp/ topic-modeling

I am trying to do a topic labeling problem on a large dataset of research papers.我正在尝试对大型研究论文数据集进行主题标记问题。 The idea is that I can give each paper a few relevant labels.这个想法是我可以给每篇论文几个相关的标签。

I have 2 questions.我有2个问题。

I know you can do topic modeling in a variety of ways like using LDA and NMF, but what can you do to later extract possible labels from those topics?我知道您可以通过各种方式进行主题建模，例如使用 LDA 和 NMF，但是您可以做些什么来稍后从这些主题中提取可能的标签？

Also, assuming I have extracted a bunch of labels, how can I mathematically estimate their accuracy?另外，假设我提取了一堆标签，我如何从数学上估计它们的准确性？ Is there some kind of metric available that can determine say, the variance of the information explained by a label in a document, or something along those lines?是否有某种可用的度量标准可以确定文档中标签解释的信息的方差，或者类似的东西？ How would I evaluate my labels without a large group of humans doing qualitative analysis?如果没有一大群人进行定性分析，我将如何评估我的标签？

1 个解决方案

The most simple way is to use the top k words as the labels.最简单的方法是使用前k个单词作为标签。 More complicated methods include candidate label generation and candidate label ranking.更复杂的方法包括候选标签生成和候选标签排序。 Many related papers talking about this topic:许多相关论文都在谈论这个话题：

Aletras, Nikolaos, and Mark Stevenson.阿莱特拉斯、尼古拉斯和马克·史蒂文森。 "Labelling topics using unsupervised graph-based methods." “使用基于无监督图的方法标记主题。” ACL.访问控制列表。 2014 2014年
Bhatia, Shraey, Jey Han Lau, and Timothy Baldwin.巴蒂亚、Shraey、Jey Han Lau 和蒂莫西·鲍德温。 "Automatic labelling of topics with neural embeddings." “使用神经嵌入自动标记主题。” COLING (2016).冷却 (2016)。
Hingmire, Swapnil, et al. Hingmire、Swapnil 等。 "Document classification by topic labeling." “按主题标签进行文档分类。” SIGIR. SIGIR。 2013 2013年