I have a dataframe with several quantitative columns and a column that contains text. I used Doc2Vec to create Document Vectors for the documents in the text column. Each row contains one document. I then appended these Vectors to my Dataframe in a new column. The column looks like this:
0 [0.47076994, -0.09282584, -0.09208749, 0.30252...
1 [-0.15832177, 0.38922963, -0.31112054, 0.22017...
2 [0.5254741, -0.34781212, -0.53806645, 0.081143...
3 [0.17344594, 0.013028251, 0.20382093, 0.060029...
4 [0.08430116, 0.032912098, -0.065583326, -0.071...
...
36428 [0.09610936, 0.041300587, 0.059615657, -0.2326...
36429 [0.06046782, 0.024129117, 0.4055158, -0.180141...
36430 [0.022796798, -0.28664422, 0.48804978, 0.12942...
36431 [0.02612863, 0.028734691, 0.14668714, 0.335453...
36432 [-0.1277831, 0.2591519, 0.22287264, -0.1575013...
Name: DocVecs, Length: 36433, dtype: object
Furthermore, I created a xgboost model, that works fine for the quantitative columns in my dataframe but whenever I plug my vectors along with my other variables in the xgboost model, I get the follwing error message:
Train Index: [ 3644 3645 3646 ... 36430 36431 36432]
Test Index: [ 0 1 2 ... 3641 3642 3643]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-d67aadd312b1> in <module>
----> 1 boosting(dataframe, ["var1", "var2", "var3", "var4", "var5", "var6","var7", "var8", "var9", "DocVecs"])
<ipython-input-53-a08737678d5a> in boosting(df, features)
15 print("Test Index: ", test_index)
16 X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
---> 17 model.fit(X_train,y_train)
18 #Prediction
19 y_pred = model.predict(X_test)
~\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, callbacks)
358 missing=self.missing, nthread=self.n_jobs)
359 else:
--> 360 trainDmatrix = DMatrix(X, label=y, missing=self.missing, nthread=self.n_jobs)
361
362 evals_result = {}
~\Anaconda3\lib\site-packages\xgboost\core.py in __init__(self, data, label, missing, weight, silent, feature_names, feature_types, nthread)
402 self._init_from_csc(data)
403 elif isinstance(data, np.ndarray):
--> 404 self._init_from_npy2d(data, missing, nthread)
405 elif isinstance(data, DataTable):
406 self._init_from_dt(data, nthread)
~\Anaconda3\lib\site-packages\xgboost\core.py in _init_from_npy2d(self, mat, missing, nthread)
476 # we try to avoid data copies if possible (reshape returns a view when possible
477 # and we explicitly tell np.array to try and avoid copying)
--> 478 data = np.array(mat.reshape(mat.size), copy=False, dtype=np.float32)
479 handle = ctypes.c_void_p()
480 missing = missing if missing is not None else np.nan
ValueError: setting an array element with a sequence
any ideas, how to fix that?
Here is the code that produces the error. I defined a function to combine xgboost with kfold cv.
def boost(df, features):
# shuffle
df= df.reindex(np.random.permutation(df.index))
# Define y and X
X = df[features].values
y = df['output'].values
# K-fold cross Validation and XGBoost
RMSE = []
model = xgb.XGBRegressor(max_depth=15, estimators=1000, learning_rate=0.1)
cv = KFold(n_splits= 10, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
print("Train Index: ", train_index)
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
model.fit(X_train,y_train)
#Prediction
y_pred = model.predict(X_test)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
# RMSE
rmse = np.sqrt(mean_squared_error(y_test,y_pred))
print(rmse)
RMSE.append(rmse)
print(np.mean(RMSE))
When I plug my vectors into the boost function, the error occurs....
This error happens under the hood. You are trying to set a numpy array element to be something that is not a number.
One thing that may fix it is forcing all the X values to be floats, eg
X = df[features].values
to
X = [np.array(x).astype(np.float16) for x in df.features.values]
However, if the underlying issue is that X
has some row that isn't a vector, or one of the rows has a column value that isn't a number, I don't know if this will fix it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.