简体   繁体   English

在Sklearn中将概率估计与支持向量机相结合

[英]Combining probabilistic estimates with support vector machine in sklearn

I'm currently using a support vector machine to predict which item a user will buy given demographic data. 我目前正在使用支持向量机来预测用户将购买给定人口统计数据的商品。 The data set also includes how many users of a certain age group bought each item. 数据集还包括一定年龄组的多少用户购买了每件商品。 It looked something like this: 它看起来像这样:

   items a b c
age 
15-20    10 3 10
20-25    1 5 6
25-30    2 5 6

I am unsure how to incorporate this into the training data, as the only way I can think of to incorporate this is to include a set of probability values of the user buying the item, but this is very unwieldy. 我不确定如何将其合并到训练数据中,因为我可以想到的唯一方法是将用户购买该项目的一组概率值包括在内,但这非常笨拙。 Another idea I had was to use an ensemble learning method, and combine the svm with maybe a Naive Bayes classifier. 我的另一个想法是使用整体学习方法,并将svm与Naive Bayes分类器结合起来。 I am using sklearn to build my model. 我正在使用sklearn建立我的模型。

When you want to introduce weightings for data points, SVM is no longer so attractive. 当您想为数据点引入权重时,SVM不再那么吸引人。 The underlying algebra doesn't work as well when identical or very close data points have differing classifications. 当相同或非常接近的数据点具有不同的分类时,基础代数就无法正常工作。 From the data you give above, I do expect that Naive Bayes will give a faster computation and cleaner results. 从上面提供的数据中,我确实希望朴素贝叶斯能够提供更快的计算速度和更清晰的结果。

That said, what SVM algorithm are you using? 也就是说,您使用的是哪种SVM算法? If it's one that weights the vectors in some fashion -- use each point exactly once, or pick a random point for each iteration of a gradient descent approach -- then you can certainly handle this by adding each point to your training set the given number of times. 如果是以某种方式对向量加权的方法-每个点精确使用一次,或为梯度下降方法的每次迭代选择一个随机点-那么您可以通过将每个点添加到训练集中给定的数字来处理的时间。 For instance, you'd have 10 rows stating that teens bought item a. 例如,您将有10行说明青少年购买了商品a。

On the other hand, Naive Bayes would give you weightings for a statistically accurate model. 另一方面,朴素贝叶斯(Naive Bayes)会为您提供统计准确模型的权重。 Instead of predicting almost unilaterally that 20-somethings will buy item c (which is actually a large minority of the purchases), you'd have a model that could tell you that 48% of people in their late 20s will buy item c, and almost as many will buy item b. 与其几乎单方面地预测20岁左右的人会购买商品c(实际上是购买商品的一小部分),不如说有一个模型可以告诉您,在20多岁的年龄段中有48%的人会购买商品c,并且几乎有很多人会购买b。

Does this discussion help? 这个讨论有帮助吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM