[英]How to handle lots of missing values in GradientBoostingClassifier in sklearn
All features are in float data type, whereas there are some features with dominant amount of NaN. 所有特征都是浮点数据类型,而有一些特征具有显着量的NaN。 I tried to train model via
GradientBoostingClassifier
as below. 我尝试通过
GradientBoostingClassifier
训练模型,如下所示。
train_x, test_x, train_y, test_y = train_test_split(features[feature_headers], features[target_header], test_size=0.33, random_state=int(time.time()))
clf = GradientBoostingClassifier(random_state=int(time.time()), learning_rate=0.1, max_leaf_nodes=None, min_samples_leaf=1, n_estimators=300, min_samples_split=2, max_features=None)
clf.fit(train_x, train_y)
But error will be thrown: 但是会抛出错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
ValueError:输入包含NaN,无穷大或对于dtype('float32')而言太大的值。
I couldn't use some Imputation methods to fill in the NaN with either mean, median or most_frequent since it doesn't make any sense from the data's perspective. 我不能使用一些Imputation方法用mean,median或most_frequent填充NaN,因为从数据的角度来看它没有任何意义。 Is there any better way to make classifier recognize NaN and treat it as a indicative feature as well?
有没有更好的方法让分类器识别NaN并将其视为指示性功能? Thanks a lot.
非常感谢。
You will have to perform data cleaning. 您必须执行数据清理。 For that, you need to see which all
columns
you are going to include in the training dataset. 为此,您需要查看要包含在训练数据集中的所有
columns
。 For float, you may replace all null
values with zero 对于float,您可以将所有
null
值替换为零
df.col1 = df.col1.fillna(0)
and for strings, you may replace it the default value. 对于字符串,您可以将其替换为默认值。
df.col2 = df.DISTANCE_GROUP.fillna('')
Now, if you want to place average
or some trend value, you may use the same learned algorithm to predict missing values and fill up. 现在,如果要放置
average
或某些趋势值,可以使用相同的学习算法来预测缺失值并填充。 For running the algorithm, first replace null values and then later can be altered with more accurate predicted values. 为了运行算法,首先替换空值,然后可以使用更准确的预测值进行更改。
Note: Any learning algorithm can't run with null values.
注意:任何学习算法都不能使用空值运行。
xgboost.XGBClassifier
handle np.nan
without imputation see here . xgboost.XGBClassifier
句柄np.nan
没有插补看到这里 。
xgboost
has a sklearn
api easy to use look at the documentation . xgboost
有一个sklearn
api易于使用的文档 。
xgboost.XGBClassifier
is fundamentally very close form GradientBoostingClassifier
, both are Gradient Boosting methods for classification. xgboost.XGBClassifier
与GradientBoostingClassifier
基本上非常接近,两者都是用于分类的Gradient Boosting方法。 See for exemple here . 参见此处的例子。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.