How to handle lots of missing values in GradientBoostingClassifier in sklearn

Question

All features are in float data type, whereas there are some features with dominant amount of NaN. I tried to train model via GradientBoostingClassifier as below.

train_x, test_x, train_y, test_y = train_test_split(features[feature_headers], features[target_header], test_size=0.33, random_state=int(time.time()))
clf = GradientBoostingClassifier(random_state=int(time.time()), learning_rate=0.1, max_leaf_nodes=None, min_samples_leaf=1, n_estimators=300, min_samples_split=2, max_features=None)
clf.fit(train_x, train_y)

But error will be thrown:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I couldn't use some Imputation methods to fill in the NaN with either mean, median or most_frequent since it doesn't make any sense from the data's perspective. Is there any better way to make classifier recognize NaN and treat it as a indicative feature as well? Thanks a lot.

Answer 1

You will have to perform data cleaning. For that, you need to see which all columns you are going to include in the training dataset. For float, you may replace all null values with zero

df.col1 = df.col1.fillna(0)

and for strings, you may replace it the default value.

df.col2 = df.DISTANCE_GROUP.fillna('')

Now, if you want to place average or some trend value, you may use the same learned algorithm to predict missing values and fill up. For running the algorithm, first replace null values and then later can be altered with more accurate predicted values.

Note: Any learning algorithm can't run with null values.

Answer 2

xgboost.XGBClassifier handle np.nan without imputation see here .

xgboost has a sklearn api easy to use look at the documentation .

xgboost.XGBClassifier is fundamentally very close form GradientBoostingClassifier , both are Gradient Boosting methods for classification. See for exemple here .

How to handle lots of missing values in GradientBoostingClassifier in sklearn

Question

2 answers

solution1
3 ACCPTED 2017-11-25 09:39:01

solution2
1 2019-03-07 10:01:31

How to handle lots of missing values in GradientBoostingClassifier in sklearn

Question

2 answers

solution1 3 ACCPTED 2017-11-25 09:39:01

solution2 1 2019-03-07 10:01:31

solution1
3 ACCPTED 2017-11-25 09:39:01

solution2
1 2019-03-07 10:01:31