简体繁体 English

主题建模上下文中的短文本

[英]Short text in the context of topic modeling

原文 2020-06-09 10:29:10 2 1 python-3.x/ nlp/ lda/ topic-modeling/ nmf

I am working on topic modeling and I am curious what exactly would be short text under this context?For example, if there is a research paper,would the research paper's title and abstract be considered as short text?我正在研究主题建模，我很好奇在这种情况下究竟什么是短文本？例如，如果有一篇研究论文，研究论文的标题和摘要是否会被视为短文本？

1 个解决方案

I am working on topic modeling and I am curious what exactly would be short text under this context?我正在研究主题建模，我很好奇在这种情况下短文本到底是什么？

The recent survey paper on short text topic modeling (by Qiang et al. ) mentions several datasets on which such models are evaluated: search snippets, StackOverflow question titles, tweets, and some others.最近关于短文本主题建模的调查论文（ Qiang 等人）提到了评估此类模型的几个数据集：搜索片段、StackOverflow 问题标题、推文等。 The documents in these datasets have 5-14 words on average, and 14-37 words at maximum.这些数据集中的文档平均有 5-14 个单词，最多 14-37 个单词。

For example, if there is a research paper, would the research paper's title and abstract be considered as short text?例如，如果有一篇研究论文，研究论文的标题和摘要是否会被视为短文本？

Paper abstracts that may have a bigger length.可能有更大长度的论文摘要。 It is usual that the abstract has 200 or 300 words or even more.摘要通常有 200 或 300 个字，甚至更多。

The second argument that should be mentioned is that some short text topic modeling techniques assume that each text has exactly one topic (for example, in the paper by Yin & Wang ).应该提到的第二个论点是，一些短文本主题建模技术假设每个文本只有一个主题（例如，在Yin & Wang的论文中）。 I think it's possible that the abstract may have several topics in it.我认为摘要中可能包含多个主题。 So, some of the models that assume one topic per one document may perform badly on paper abstracts.因此，一些假设每个文档一个主题的模型可能在论文摘要上表现不佳。