简体繁体 English

机器学习算法仅对阳性和未标记数据进行分类

[英]Machine learning algorithm to classify only positive and unlabeled data

原文 2014-04-04 20:59:36 3 1 algorithm/ machine-learning/ weka

I am trying to classify text with only positive features and unlabeled data. 我正在尝试仅使用积极特征和未标记数据对文本进行分类。 I just want the algorithm to identify the positive data and want to mark everything else as negative. 我只希望算法识别出阳性数据，并希望将其他所有内容标记为阴性。 What would be a good machine learning algorithm to classify such data? 什么是将此类数据分类的良好机器学习算法？ I tried using different algorithms in Weka but almost all classifiers give a lot of false positives. 我曾尝试在Weka中使用不同的算法，但几乎所有分类器都会带来很多误报。

1 个解决方案

If you believe that the unlabelled data is mostly negatives, then probably the best thing to do is to label all unlabelled data as "negative" and run your classifier of choice. 如果您认为未标记的数据主要是负数，那么最好的办法是将所有未标记的数据标记为“负”并运行您选择的分类器。 Note that if you get an unlabelled testing data point predicted to be positive, this does not mean the answer is wrong. 请注意，如果您得到的未标记测试数据点预计为阳性，则并不意味着答案是错误的。 Some of your unlabelled data could be positive. 您的一些未标记数据可能是肯定的。 So it's hard to judge how well your classifier is doing in your setting. 因此，很难判断您的分类器在您的设置中的表现如何。 If you believe that your unlabelled data might be biased toward the positives then you're probably better off using so-called "one-class classifiers" on the positive data, there are popular examples including one-class SVM. 如果您认为未标记的数据可能偏向正值，那么最好在正值数据上使用所谓的“一类分类器”，其中包括一类SVM。