Pandas：ValueError（任何将 Sparse[float64, 0.0] dtypes 转换为 float64 数据类型的方法）

Question

I have a dataframe X_train to which i am concatenating a couple of another dataframe.我有一个数据帧 X_train，我正在将几个另一个数据帧连接到该数据帧。 This second & third dataframe is obtained from sparse matrix which has been been generated by a TF-IDF VEctorizer第二个和第三个数据帧是从 TF-IDF 向量生成器生成的稀疏矩阵中获得的

q1_train_df = pd.DataFrame.sparse.from_spmatrix(q1_tdidf_train,index=X_train.index,columns=q1_features)
q2_train_df = pd.DataFrame.sparse.from_spmatrix(q2_tdidf_train,index=X_train.index,columns=q2_features)
X_train_final  = pd.concat([X_train,q1_train_df,q2_train_df],axis=1)

X_train_final dtypes is looking as below X_train_final dtypes 如下所示


X_train_final.dtypes

cwc_min                       float64
cwc_max                       float64
csc_min                       float64
csc_max                       float64
ctc_min                       float64
                         ...         
q2_zealand       Sparse[float64, 0.0]
q2_zero          Sparse[float64, 0.0]
q2_zinc          Sparse[float64, 0.0]
q2_zone          Sparse[float64, 0.0]
q2_zuckerberg    Sparse[float64, 0.0]
Length: 10015, dtype: object

I am using XGBoost to train on this final dataframe and this is throwing error while trying to fit the data我正在使用 XGBoost 来训练这个最终数据帧，这在尝试拟合数据时抛出错误

model.fit( X_train_final,y_train)


ValueError: DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields q1_04, q1_10, q1_100, q

I think the error is due to Sparse[float64,0.0] dtypes present in it .我认为该错误是由于其中存在 Sparse[float64,0.0] dtypes。 Can you please help me out, not able to figure out how to get out of this error ??你能帮我一下吗，无法弄清楚如何摆脱这个错误？？

Answer 1

I actually just came across the same exact issue.我实际上只是遇到了同样的问题。 I have a list of columns that were generated using TF-IDF vectorizor and I was attempting to use XGBoost on the dataset.我有一个使用 TF-IDF vectorizor 生成的列列表，我试图在数据集上使用 XGBoost。

This ended up working for me:这最终对我有用：

import xgboost as xgb


train_df = train_df.apply(pd.to_numeric, errors='coerce')

train_df[tf_idf_column_names] = train_df[tf_idf_column_names].sparse.to_dense()

train_x = train_df.iloc[:,1:]

train_y = train_df.iloc[:,:1]

dtrain= xgb.DMatrix(data=train_x, label=train_y)

param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}

num_round = 2

bst = xgb.train(param, dtrain, num_round)

preds = bst.predict(dtest)

Answer 2

X_train_final = hstack( blocks=(x_tr_cwc_min,\
                            x_tr_cwc_max,\
                            x_tr_csc_min,\
                            x_tr_csc_max,\
                            x_tr_ctc_min,\
                            x_tr_ctc_max,\
                            x_tr_last_word_eq,\
                            x_tr_first_word_eq,\
                            x_tr_abs_len_diff,\
                            x_tr_mean_len,\
                            x_tr_token_set_ratio,\
                            x_tr_token_sort_ratio,\
                            x_tr_fuzz_ratio,\
                            x_tr_fuzz_partial_ratio,\
                            x_tr_longest_substr_ratio,\
                            q1_tdidf_train,q2_tdidf_train
                           )
                  ).tocsr()

Here instead of using X_train dataframe directly, i used individual columns of X_train and converted each of these to ndarrays.在这里，我没有直接使用 X_train 数据帧，而是使用了 X_train 的各个列并将这些列中的每一个都转换为 ndarrays。 To dense was working but for the dataframe i used, it consumed almost 3 GB of space !!!密集工作但对于我使用的数据帧，它消耗了近 3 GB 的空间！ So had to go with this approach所以不得不采用这种方法

Pandas：ValueError（任何将 Sparse[float64, 0.0] dtypes 转换为 float64 数据类型的方法）

问题描述

2 个解决方案

解决方案1
0 2020-02-07 23:51:00

解决方案2
0 已采纳 2020-02-10 06:37:42

Pandas：ValueError（任何将 Sparse[float64, 0.0] dtypes 转换为 float64 数据类型的方法）

问题描述

2 个解决方案

解决方案1 0 2020-02-07 23:51:00

解决方案2 0 已采纳 2020-02-10 06:37:42

解决方案1
0 2020-02-07 23:51:00

解决方案2
0 已采纳 2020-02-10 06:37:42