简体繁体 English

用于对类似于另一个较小数据集的数据集中的实例进行分类的算法，其中此较小数据集表示单个类

[英]Algorithm to classify instances from a dataset similar to another smaller dataset, where this smaller dataset represents a single class

原文 2019-05-22 11:18:14 0 2 machine-learning/ cluster-analysis/ weka/ data-mining

I have a dataset that represents instances from a binary class. 我有一个表示二进制类实例的数据集。 The twist here is that there are only instances from the positive class and I have none of the negative one. 这里的转折是只有正类的实例，我没有负面的实例。 Or rather, I want to extract those from the negatives which are closer to the positives. 或者更确切地说，我想从更接近正面的负面中提取那些。

To get more concrete let's say we have data of people who bought from our store and asked for a loyalty card at the moment or later of their own volition . 为了得到更具体的结果，我们可以说我们拥有从我们的商店购买的人的数据，并且在他们自愿的时刻或之后要求获得会员卡 。 Privacy concerns aside (it's just an example) we have different attributes like age, postcode, etc. 除了隐私问题（这只是一个例子）我们有不同的属性，如年龄，邮政编码等。

The other set of clients, following with our example, are clientes that did not apply for the card. 另外一组客户，以我们的例子为例，是不适用于该卡的客户。

What we want is to find a subset of those that are most similar to the ones that applied for the loyalty card in the first group, so that we can send them an offer to apply for the loyalty program. 我们想要的是找到与第一组中的忠诚卡最相似的那些子集，以便我们可以向他们发送申请忠诚度计划的要约。

It's not exactly a classification problem because we are trying to get instances from within the group of "negatives". 这不是一个分类问题，因为我们试图从“负面”组中获取实例。

It's not exactly clustering, which is typically unsupervised, because we already know a cluster (the loyalty card clients). 它并不完全是聚类，通常是无人监管的，因为我们已经知道了一个集群（忠诚卡客户端）。

I thought about using kNN. 我想过使用kNN。 But I don't really know what are my options here. 但我真的不知道我的选择是什么。

I would also like to know how, if possible, can this be achieved with weka or another Java library and if I should normalize all the attributes. 我还想知道如果可能的话，可以通过weka或其他Java库实现这一点，并且我是否应该规范化所有属性。

2 个解决方案

You could use anomaly detection algorithms. 您可以使用异常检测算法。 These algorithms tell you whether your new client belongs to the group of clients who got a loyalty card or not (in which case they would be an anomaly). 这些算法会告诉您新客户是否属于获得会员卡的客户群（在这种情况下，他们将是异常）。

There are two basic ideas (coming from the article I linked below): 有两个基本想法（来自我在下面链接的文章）：

You transform the feature vectors of your positive labelled data ( clients with card ) to a vector space with a lower dimensionality (eg by using PCA). 您将正标记数据（ 带卡的客户端 ）的特征向量转换为具有较低维度的向量空间（例如，通过使用PCA）。 Then you can calculate the probability distribution for the resulting transformed data and find out whether a new client belongs to the same statistical distribution or not. 然后，您可以计算得到的转换数据的概率分布，并找出新客户端是否属于同一统计分布。 You can also compute the distance of a new client to the centroid of the transformed data and decide by using the standard deviation of the distribution whether it is still close enough . 您还可以计算新客户端与转换数据的质心的距离，并通过使用分布的标准偏差来确定它是否仍然足够接近 。
The Machine Learning Approach: You train an auto-encoder network on the clients with card data. 机器学习方法：您使用卡数据在客户端上训练自动编码器网络。 An auto-encoder has a bottleneck in its architecture. 自动编码器在其架构中存在瓶颈。 It compresses the input data into a new feature vector with a lower dimensionality and tries afterwards to reconstruct the input data from that compressed vector. 它将输入数据压缩为具有较低维度的新特征向量，然后尝试从该压缩向量重建输入数据。 If the training is done correctly, the reconstruction error for input data similar to the clients with card dataset should be smaller than for input data which is not similar to it (hopefully these are clients who do not want a card). 如果训练正确完成，输入数据的重建错误类似于具有卡数据集的客户端应该小于与其不相似的输入数据（希望这些是不需要卡的客户端）。

Have a look at this tutorial for a start: https://towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7 看一下本教程的开头： https ： //towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7

Both methods would require to standardize the attributes first. 这两种方法都需要首先标准化属性。

Und try a one-class support vector machine . 不要尝试一流的支持向量机 。

This approach tries to model the boundary, and will give you a binary decision on whether a point should be in the class, or not. 这种方法试图对边界进行建模，并且会给出关于点是否应该在类中的二元决策。 It can be seen as a simple density estimation. 它可以看作是一种简单的密度估计。 The main benefit is that the support vector art will be much smaller than the training data. 主要好处是支持向量技术将比训练数据小得多。

Or simply use the nearest-neighbor distances to rank users. 或者只是使用最近邻距离来对用户进行排名。