[英]Undersampling/Oversampling issues with onehotencoded categorical data
I am trying to fit a classification problem which has a (40000 vs 400) split between 0 and 1 class. I am trying to play around with oversampling and undersampling (not preferred) but keep running into issues.我正在尝试解决一个分类问题,该问题在 0 和 1 class 之间有一个 (40000 vs 400) 拆分。我正在尝试使用过采样和欠采样(不是首选),但一直遇到问题。
Error- Shape of passed values is (34372, 1), indices imply (34372, 36)错误 - 传递值的形状为 (34372, 1),索引表示 (34372, 36)
258 print("Before undersampling X_train:",X_train.shape[0])
259
--> 260 X_train,y_train=ros(X_train,y_train) #change this to ro_smote for oversampling
261 print("After undersampling/oversampling X_train:",X_train.shape[0])
262 X_train[label_fg] = y_train
/tmp/tmpta5bmz69.py in ros(X_train, y_train)
131 def ros(X_train,y_train):
132 ros = RandomOverSampler(random_state=1,sampling_strategy = 0.25) #sampling-stragey- 0.25,0.5,1,0.75
--> 133 X_train_on, y_train_on = ros.fit_resample(X_train, y_train)
134
135 return X_train_on,y_train_on
/databricks/python/lib/python3.8/site-packages/imblearn/base.py in fit_resample(self, X, y)
87 )
88
---> 89 X_, y_ = arrays_transformer.transform(output[0], y_)
90 return (X_, y_) if len(output) == 2 else (X_, y_, output[2])
91
/databricks/python/lib/python3.8/site-packages/imblearn/utils/_validation.py in transform(self, X, y)
38
39 def transform(self, X, y):
---> 40 X = self._transfrom_one(X, self.x_props)
41 y = self._transfrom_one(y, self.y_props)
42 return X, y
/databricks/python/lib/python3.8/site-packages/imblearn/utils/_validation.py in _transfrom_one(self, array, props)
57 import pandas as pd
58
---> 59 ret = pd.DataFrame(array, columns=props["columns"])
60 ret = ret.astype(props["dtypes"])
61 elif type_ == "series":
/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
582 mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
583 else:
--> 584 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
585 else:
586 mgr = init_dict({}, index, columns, dtype=dtype)
/databricks/python/lib/python3.8/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
236 block_values = [values]
237
--> 238 return create_block_manager_from_blocks(block_values, [columns, index])
239
240
/databricks/python/lib/python3.8/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
1685 blocks = [getattr(b, "values", b) for b in blocks]
1686 tot_items = sum(b.shape[0] for b in blocks)
-> 1687 raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
1688
1689
ValueError: Shape of passed values is (34372, 1), indices imply (34372, 36)Thu Aug 25 14:52:24 2022 Python shell started with PID 4674 and guid b28118c68bbf497ea6029cc003bff481
Please note that i have onehotencoded my categorical dataset which has resulted into 36 features and i have changed them into 'int'.请注意,我已经将我的分类数据集编码为 36 个特征,并将它们更改为“int”。
Am I missing something here?我在这里错过了什么吗?
preped_data=feature_engg(preped_data)
preped_data = preped_data.astype(int)
def ros(X_train,y_train):
ros = RandomOverSampler(random_state=1,sampling_strategy = 0.25)
X_train_on, y_train_on = ros.fit_resample(X_train, y_train)
return X_train_on,y_train_on
label_fg='churn_fg'
X_train, X_test, y_train, y_test = train_test_split(
preped_data.drop(label_fg, axis=1), preped_data[label_fg], stratify=preped_data[label_fg],
shuffle=True, test_size=0.3, random_state=42)
print("Before undersampling X_train columns:",X_train.columns)
print("Before undersampling X_train:",X_train.shape[0])
X_train,y_train=ros(X_train,y_train)
I experienced the same issue after using one-hot-encoder.使用 one-hot-encoder 后,我遇到了同样的问题。 This problem usually happens because one-hot-encoder returns sparse matrix (run df.info() to check this).这个问题通常是因为 one-hot-encoder 返回稀疏矩阵(运行 df.info() 来检查这个)。 To solve this issue I tried this after one-hot encoding:为了解决这个问题,我在单热编码后尝试了这个:
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')
X_train[oh-cols] = X_train[oh-cols].sparse.to_dense()
X_test[oh-cols] = X_test[oh-cols].sparse.to_dense()
which oh-cols
are the columns which needed to apply one-hot encoding.哪些oh-cols
是需要应用 one-hot 编码的列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.