How to plug doc2vec vectors into xgboost?

Question

I have a dataframe with several quantitative columns and a column that contains text. I used Doc2Vec to create Document Vectors for the documents in the text column. Each row contains one document. I then appended these Vectors to my Dataframe in a new column. The column looks like this:

0        [0.47076994, -0.09282584, -0.09208749, 0.30252...
1        [-0.15832177, 0.38922963, -0.31112054, 0.22017...
2        [0.5254741, -0.34781212, -0.53806645, 0.081143...
3        [0.17344594, 0.013028251, 0.20382093, 0.060029...
4        [0.08430116, 0.032912098, -0.065583326, -0.071...
                               ...                        
36428    [0.09610936, 0.041300587, 0.059615657, -0.2326...
36429    [0.06046782, 0.024129117, 0.4055158, -0.180141...
36430    [0.022796798, -0.28664422, 0.48804978, 0.12942...
36431    [0.02612863, 0.028734691, 0.14668714, 0.335453...
36432    [-0.1277831, 0.2591519, 0.22287264, -0.1575013...
Name: DocVecs, Length: 36433, dtype: object

Furthermore, I created a xgboost model, that works fine for the quantitative columns in my dataframe but whenever I plug my vectors along with my other variables in the xgboost model, I get the follwing error message:

Train Index:  [ 3644  3645  3646 ... 36430 36431 36432]
Test Index:  [   0    1    2 ... 3641 3642 3643]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-d67aadd312b1> in <module>
----> 1 boosting(dataframe, ["var1", "var2", "var3", "var4", "var5", "var6","var7", "var8", "var9", "DocVecs"])

<ipython-input-53-a08737678d5a> in boosting(df, features)
     15         print("Test Index: ", test_index)
     16         X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
---> 17         model.fit(X_train,y_train)
     18         #Prediction
     19         y_pred = model.predict(X_test)

~\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, callbacks)
    358                                    missing=self.missing, nthread=self.n_jobs)
    359         else:
--> 360             trainDmatrix = DMatrix(X, label=y, missing=self.missing, nthread=self.n_jobs)
    361 
    362         evals_result = {}

~\Anaconda3\lib\site-packages\xgboost\core.py in __init__(self, data, label, missing, weight, silent, feature_names, feature_types, nthread)
    402             self._init_from_csc(data)
    403         elif isinstance(data, np.ndarray):
--> 404             self._init_from_npy2d(data, missing, nthread)
    405         elif isinstance(data, DataTable):
    406             self._init_from_dt(data, nthread)

~\Anaconda3\lib\site-packages\xgboost\core.py in _init_from_npy2d(self, mat, missing, nthread)
    476         # we try to avoid data copies if possible (reshape returns a view when possible
    477         # and we explicitly tell np.array to try and avoid copying)
--> 478         data = np.array(mat.reshape(mat.size), copy=False, dtype=np.float32)
    479         handle = ctypes.c_void_p()
    480         missing = missing if missing is not None else np.nan

ValueError: setting an array element with a sequence

any ideas, how to fix that?

Here is the code that produces the error. I defined a function to combine xgboost with kfold cv.

def boost(df, features):
    # shuffle 
    df= df.reindex(np.random.permutation(df.index))
    # Define y and X 
    X = df[features].values
    y = df['output'].values

    # K-fold cross Validation and XGBoost
    RMSE = []
    model = xgb.XGBRegressor(max_depth=15, estimators=1000, learning_rate=0.1) 
    cv = KFold(n_splits= 10, random_state=42, shuffle=False)

    for train_index, test_index in cv.split(X):
        print("Train Index: ", train_index)
        print("Test Index: ", test_index)
        X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
        model.fit(X_train,y_train)
        #Prediction 
        y_pred = model.predict(X_test)
        pred = pd.DataFrame()
        pred["Prediction"] = y_pred
        # RMSE 
        rmse = np.sqrt(mean_squared_error(y_test,y_pred))
        print(rmse)
        RMSE.append(rmse)
    print(np.mean(RMSE))

When I plug my vectors into the boost function, the error occurs....

Answer 1

This error happens under the hood. You are trying to set a numpy array element to be something that is not a number.

One thing that may fix it is forcing all the X values to be floats, eg

X = df[features].values

to

X = [np.array(x).astype(np.float16) for x in df.features.values]

However, if the underlying issue is that X has some row that isn't a vector, or one of the rows has a column value that isn't a number, I don't know if this will fix it.

How to plug doc2vec vectors into xgboost?

Question

1 answers

solution1
0 2019-09-10 22:38:53

How to plug doc2vec vectors into xgboost?

Question

1 answers

solution1 0 2019-09-10 22:38:53

solution1
0 2019-09-10 22:38:53