[英]Predict one column based on the other columns with XGBoost in python
我有一個大的 dataframe,我想用 xgboost 根據其他列預測最后一列,我的代碼寫在下面,但我的預測是錯誤的,我得到了常數值。 數據不是時間序列的,我的樹也無法繪制。
總的來說,有 20 列是否有可能,而我只想通過這種方法使用其他第 19 列來預測第 20 列?
#XGBoost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
#Separate the target variable
X, y = f.iloc[:,:-1],f.iloc[:,-1]
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
#Regressor
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
#Fit the regressor to the training set and make predictions on the test set
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
#RMSE
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
#k-fold Cross Validation
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 10, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
print((cv_results["test-rmse-mean"]).tail(1))
#Visualizing
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
#plot the trees
import matplotlib.pyplot as plt
xgb.plot_tree(xg_reg,num_trees=5)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()
#Examine the importance of each feature column in the original dataset within the model
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()
首先,是的,用前 19 列預測最后一列的方法是可以的。
如果 model 只產生常數值,我會更改 model 的參數。
或者先訓練一個線性 model 作為基線。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.