简体   繁体   English

xgboost二进制逻辑回归

[英]xgboost binary logistic regression

I am having problems running logistic regression with xgboost that can be summarized on the following example. 我在使用xgboost运行逻辑回归时遇到问题,可以在下面的示例中进行总结。

Lets assume I have a very simple dataframe with two predictors and one target variable: 让我们假设我有一个非常简单的数据框,有两个预测变量和一个目标变量:

df= pd.DataFrame({'X1' : pd.Series([1,0,0,1]), 'X2' : pd.Series([0,1,1,0]), 'Y' : pd.Series([0,1,1,0], )})

I can post images because Im new here, but we can clearly see that when X1 =1 and X2=0, Y is 0 and when X1=0 and X2=1, Y is 1. 我可以发布图像,因为我是新来的,但我们可以清楚地看到,当X1 = 1且X2 = 0时,Y为0,当X1 = 0且X2 = 1时,Y为1。

My idea is to build a model that outputs the probability that an observation belongs to each one of the classes, so if I run xgboost trying to predict two new observations (1,0) and (0,1) like so: 我的想法是构建一个输出观察属于每个类的概率的模型,所以如果我运行xgboost试图预测两个新观察(1,0)和(0,1),如下所示:

X = df[['X1','X2']].values            
y = df['Y'].values

params  = {'objective': 'binary:logistic',
          'num_class': 2
          } 

clf1 = xgb.train(params=params, dtrain=xgb.DMatrix(X, y), num_boost_round=100)                    
clf1.predict(xgb.DMatrix(test.values)) 

the output is: 输出是:

array([[ 0.5,  0.5],
       [ 0.5,  0.5]], dtype=float32)

which, I imagine, means that for the first observation, there is 50% chance it belonging to each one of the classes. 我想,这意味着,对于第一次观察,它有50%的可能性属于每个类。

I'd like to know why wont the algorithm output a proper (1,0) or something closer to that if the relationship between the variables is clear. 我想知道为什么算法输出一个正确的(1,0)或更接近于如果变量之间的关系是明确的。

FYI, I did try with more data (Im only using 4 rows for simplicity) and the behavior is almost the same; 仅供参考,我确实尝试过更多数据(为简单起见,我只使用了4行)并且行为几乎相同; what I do notice is that, not only the probabilities do not sum to 1, they are often very small like so: (this result is on a different dataset, nothing to do with the example above) 我注意到的是,不仅概率不总和为1,它们通常都非常小:(这个结果在不同的数据集上,与上面的例子无关)

array([[ 0.00356463,  0.00277259],
       [ 0.00315137,  0.00268578],
       [ 0.00453343,  0.00157113],

Ok - here's what is happening.. 好的 - 这是发生了什么..

The clue as to why it isn't working is in the fact that in the smaller datasets it cannot train properly. 关于它为什么不起作用的线索是因为在较小的数据集中它无法正确训练。 I trained this exact model and observing the dump of all the trees you will see that they cannot split. 我训练了这个精确的模型并观察了所有树木的倾倒,你会发现它们无法分裂。

(tree dump below) (下面的树转储)

NO SPLITS, they have been pruned! 没有破裂,他们已被修剪!

[1] "booster[0]" "0:leaf=-0" "booster[1]" "0:leaf=-0" "booster[2]" "0:leaf=-0" [7] "booster[3]" "0:leaf=-0" "booster[4]" "0:leaf=-0" "booster[5]" "0:leaf=-0" [13] "booster[6]" "0:leaf=-0" "booster[7]" "0:leaf=-0" "booster[8]" "0:leaf=-0" [19] "booster[9]" "0:leaf=-0"

There isn't enough weight is each of the leaves to overpower xgboost 's internal regularization (which penalizes it for growing) 每个叶子都没有足够的重量来压倒 xgboost内部正规化 (这会使其成长为不利因素)

This parameter may or may not be accessible from the python version, but you can grab it from R if you do a github install 可以从python版本访问此参数,也可以不访问该参数,但如果您执行github安装,则可以从R获取该参数

http://xgboost.readthedocs.org/en/latest/parameter.html http://xgboost.readthedocs.org/en/latest/parameter.html

lambda [default=1] L2 regularization term on weights lambda [default = 1]权重的L2正则项

alpha [default=0] L1 regularization term on weights alpha [default = 0]权重上的L1正则化项

basically this is why your example trains better as you add more data, but cannot train at all with only 4 examples and default settings. 基本上这就是为什么你的例子在你添加更多数据时训练更好的原因,但是只用4个例子和默认设置就无法训练。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 绘制具有3个特征的Binary Logistic回归 - Plotting Binary Logistic regression with 3 features 二元逻辑回归的数据集生成 - Data sets generation for binary logistic regression 如何将二进制类Logistic回归与Python合并 - How to merge binary class Logistic Regression with Python 将代码从二元分类器逻辑回归修改为多类“one vs all”逻辑回归 - Modifying code from binary classifier logistic regression to multi-class “one vs all” logistic regression 逻辑回归 Model(二进制)交叉表错误 = 传递值的形状问题 - Logistic Regression Model (binary) crosstab error = shape of passed values issue Spark Logistic回归用于二元分类:为预测2类应用新的阈值 - spark logistic regression for binary classification: apply new threshold for predicting 2 classes Scikit-Learn:逻辑回归 OvR 访问二元估计器/分类器 - Scikit-Learn: Logistic Regression OvR access binary estimators/classifier AttributeError: 'str' object 在二进制逻辑回归中没有属性 'decode' - AttributeError: 'str' object has no attribute 'decode' in Binary Logistic Regression 广义逻辑回归 Python:如何正确定义二元自变量? - Generalized logistic regression Python: how to correctly define binary independent variable? Logistic 回归最小化错误 - Error in Minimization of Logistic Regression
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM