简体繁体 English

最简单的方法/黑匣子根据现有的（标记的）数据集为短帖子建议标签？

[英]Simplest way/blackbox to suggest tags for short posts based on an existing (labeled) dataset?

原文 2016-12-04 21:23:07 1 1 algorithm/ machine-learning/ text-analysis

We have comments of ~50–300 chars pre-tagged with multiple topics like “music”, “tech” as well as particular films, artists etc. 我们评论了大约50-300个字符，其中预先标记了多个主题，例如“音乐”，“技术”以及特定的电影，艺术家等。

We want to train an algorithm to autotag future comments. 我们想训练一种算法来自动标记将来的评论。 We'll manually tweak suggestions to improve accuracy and manually add many more tags (eg, new artists) over time. 我们将手动调整建议以提高准确性，并随着时间的推移手动添加更多标签（例如，新艺术家）。 Posts will have one or many tags. 帖子将具有一个或多个标签。

What's the simplest way to start this? 最简单的方法是什么？ I'm looking for something as simple as adding content and tag 1, tag 2... , automatically training, and then later giving it text to get back a list of suggested tags (preferably with confidence %). 我正在寻找一些简单的操作，例如添加content和tag 1, tag 2... ，自动训练，然后再给它文本以返回建议标签的列表（最好是置信度为％）。

We will end up with thousands of tags, and potentially 100k+ posts. 我们最终将获得成千上万个标签，并可能有超过10万个帖子。

I've played around with a few things (naive bayes, LDA) but I feel there must be something simpler for such a common and simple use case. 我玩过一些东西（朴素的贝叶斯，LDA），但是我觉得对于这样一个普通而简单的用例，必须有一些更简单的东西。 Perhaps a library or SaaS to make it this straightforward. 也许是图书馆或SaaS使其变得简单明了。

1 个解决方案

Consider support vector machines - with a preliminary feature extraction made of stemming , stop-words removal , n-gramming (especially skip-ngramming may provide a substantial boost at a cost). 考虑支持向量机 -通过提取词干，停用词， n -gramming （尤其是skip-ngramming可能会带来实质性的提振）来进行初步特征提取。