如何在sklearn中的GradientBoostingClassifier中处理大量缺失值

Question

All features are in float data type, whereas there are some features with dominant amount of NaN. 所有特征都是浮点数据类型，而有一些特征具有显着量的NaN。 I tried to train model via GradientBoostingClassifier as below. 我尝试通过GradientBoostingClassifier训练模型，如下所示。

train_x, test_x, train_y, test_y = train_test_split(features[feature_headers], features[target_header], test_size=0.33, random_state=int(time.time()))
clf = GradientBoostingClassifier(random_state=int(time.time()), learning_rate=0.1, max_leaf_nodes=None, min_samples_leaf=1, n_estimators=300, min_samples_split=2, max_features=None)
clf.fit(train_x, train_y)

But error will be thrown: 但是会抛出错误：

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). ValueError：输入包含NaN，无穷大或对于dtype（'float32'）而言太大的值。

I couldn't use some Imputation methods to fill in the NaN with either mean, median or most_frequent since it doesn't make any sense from the data's perspective. 我不能使用一些Imputation方法用mean，median或most_frequent填充NaN，因为从数据的角度来看它没有任何意义。 Is there any better way to make classifier recognize NaN and treat it as a indicative feature as well? 有没有更好的方法让分类器识别NaN并将其视为指示性功能？ Thanks a lot. 非常感谢。

Answer 1

You will have to perform data cleaning. 您必须执行数据清理。 For that, you need to see which all columns you are going to include in the training dataset. 为此，您需要查看要包含在训练数据集中的所有columns 。 For float, you may replace all null values with zero 对于float，您可以将所有null值替换为零

df.col1 = df.col1.fillna(0)

and for strings, you may replace it the default value. 对于字符串，您可以将其替换为默认值。

df.col2 = df.DISTANCE_GROUP.fillna('')

Now, if you want to place average or some trend value, you may use the same learned algorithm to predict missing values and fill up. 现在，如果要放置average或某些趋势值，可以使用相同的学习算法来预测缺失值并填充。 For running the algorithm, first replace null values and then later can be altered with more accurate predicted values. 为了运行算法，首先替换空值，然后可以使用更准确的预测值进行更改。

Note: Any learning algorithm can't run with null values. 注意：任何学习算法都不能使用空值运行。

Answer 2

xgboost.XGBClassifier handle np.nan without imputation see here . xgboost.XGBClassifier句柄np.nan没有插补看到这里。

xgboost has a sklearn api easy to use look at the documentation . xgboost有一个sklearn api易于使用的文档。

xgboost.XGBClassifier is fundamentally very close form GradientBoostingClassifier , both are Gradient Boosting methods for classification. xgboost.XGBClassifier与GradientBoostingClassifier基本上非常接近，两者都是用于分类的Gradient Boosting方法。 See for exemple here . 参见此处的例子。

如何在sklearn中的GradientBoostingClassifier中处理大量缺失值

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-11-25 09:39:01

解决方案2
1 2019-03-07 10:01:31

如何在sklearn中的GradientBoostingClassifier中处理大量缺失值

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-11-25 09:39:01

解决方案2 1 2019-03-07 10:01:31

解决方案1
3 已采纳 2017-11-25 09:39:01

解决方案2
1 2019-03-07 10:01:31