All features are in float data type, whereas there are some features with dominant amount of NaN. I tried to train model via GradientBoostingClassifier
as below.
train_x, test_x, train_y, test_y = train_test_split(features[feature_headers], features[target_header], test_size=0.33, random_state=int(time.time()))
clf = GradientBoostingClassifier(random_state=int(time.time()), learning_rate=0.1, max_leaf_nodes=None, min_samples_leaf=1, n_estimators=300, min_samples_split=2, max_features=None)
clf.fit(train_x, train_y)
But error will be thrown:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I couldn't use some Imputation methods to fill in the NaN with either mean, median or most_frequent since it doesn't make any sense from the data's perspective. Is there any better way to make classifier recognize NaN and treat it as a indicative feature as well? Thanks a lot.
You will have to perform data cleaning. For that, you need to see which all columns
you are going to include in the training dataset. For float, you may replace all null
values with zero
df.col1 = df.col1.fillna(0)
and for strings, you may replace it the default value.
df.col2 = df.DISTANCE_GROUP.fillna('')
Now, if you want to place average
or some trend value, you may use the same learned algorithm to predict missing values and fill up. For running the algorithm, first replace null values and then later can be altered with more accurate predicted values.
Note: Any learning algorithm can't run with null values.
xgboost.XGBClassifier
handle np.nan
without imputation see here .
xgboost
has a sklearn
api easy to use look at the documentation .
xgboost.XGBClassifier
is fundamentally very close form GradientBoostingClassifier
, both are Gradient Boosting methods for classification. See for exemple here .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.