[英]xgboost binary logistic regression
I am having problems running logistic regression with xgboost that can be summarized on the following example. 我在使用xgboost运行逻辑回归时遇到问题,可以在下面的示例中进行总结。
Lets assume I have a very simple dataframe with two predictors and one target variable: 让我们假设我有一个非常简单的数据框,有两个预测变量和一个目标变量:
df= pd.DataFrame({'X1' : pd.Series([1,0,0,1]), 'X2' : pd.Series([0,1,1,0]), 'Y' : pd.Series([0,1,1,0], )})
I can post images because Im new here, but we can clearly see that when X1 =1 and X2=0, Y is 0 and when X1=0 and X2=1, Y is 1. 我可以发布图像,因为我是新来的,但我们可以清楚地看到,当X1 = 1且X2 = 0时,Y为0,当X1 = 0且X2 = 1时,Y为1。
My idea is to build a model that outputs the probability that an observation belongs to each one of the classes, so if I run xgboost trying to predict two new observations (1,0) and (0,1) like so: 我的想法是构建一个输出观察属于每个类的概率的模型,所以如果我运行xgboost试图预测两个新观察(1,0)和(0,1),如下所示:
X = df[['X1','X2']].values
y = df['Y'].values
params = {'objective': 'binary:logistic',
'num_class': 2
}
clf1 = xgb.train(params=params, dtrain=xgb.DMatrix(X, y), num_boost_round=100)
clf1.predict(xgb.DMatrix(test.values))
the output is: 输出是:
array([[ 0.5, 0.5],
[ 0.5, 0.5]], dtype=float32)
which, I imagine, means that for the first observation, there is 50% chance it belonging to each one of the classes. 我想,这意味着,对于第一次观察,它有50%的可能性属于每个类。
I'd like to know why wont the algorithm output a proper (1,0) or something closer to that if the relationship between the variables is clear. 我想知道为什么算法输出一个正确的(1,0)或更接近于如果变量之间的关系是明确的。
FYI, I did try with more data (Im only using 4 rows for simplicity) and the behavior is almost the same; 仅供参考,我确实尝试过更多数据(为简单起见,我只使用了4行)并且行为几乎相同; what I do notice is that, not only the probabilities do not sum to 1, they are often very small like so: (this result is on a different dataset, nothing to do with the example above)
我注意到的是,不仅概率不总和为1,它们通常都非常小:(这个结果在不同的数据集上,与上面的例子无关)
array([[ 0.00356463, 0.00277259],
[ 0.00315137, 0.00268578],
[ 0.00453343, 0.00157113],
Ok - here's what is happening.. 好的 - 这是发生了什么..
The clue as to why it isn't working is in the fact that in the smaller datasets it cannot train properly. 关于它为什么不起作用的线索是因为在较小的数据集中它无法正确训练。 I trained this exact model and observing the dump of all the trees you will see that they cannot split.
我训练了这个精确的模型并观察了所有树木的倾倒,你会发现它们无法分裂。
(tree dump below) (下面的树转储)
NO SPLITS, they have been pruned! 没有破裂,他们已被修剪!
[1] "booster[0]" "0:leaf=-0" "booster[1]" "0:leaf=-0" "booster[2]" "0:leaf=-0" [7] "booster[3]" "0:leaf=-0" "booster[4]" "0:leaf=-0" "booster[5]" "0:leaf=-0" [13] "booster[6]" "0:leaf=-0" "booster[7]" "0:leaf=-0" "booster[8]" "0:leaf=-0" [19] "booster[9]" "0:leaf=-0"
There isn't enough weight is each of the leaves to overpower xgboost
's internal regularization (which penalizes it for growing) 每个叶子都没有足够的重量来压倒
xgboost
的内部正规化 (这会使其成长为不利因素)
This parameter may or may not be accessible from the python version, but you can grab it from R
if you do a github install 可以从python版本访问此参数,也可以不访问该参数,但如果您执行github安装,则可以从
R
获取该参数
http://xgboost.readthedocs.org/en/latest/parameter.html http://xgboost.readthedocs.org/en/latest/parameter.html
lambda [default=1] L2 regularization term on weights
lambda [default = 1]权重的L2正则项
alpha [default=0] L1 regularization term on weights
alpha [default = 0]权重上的L1正则化项
basically this is why your example trains better as you add more data, but cannot train at all with only 4 examples and default settings. 基本上这就是为什么你的例子在你添加更多数据时训练更好的原因,但是只用4个例子和默认设置就无法训练。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.