简体   繁体   English

如何在训练测试拆分后仅标准化 int64 列?

[英]How do I standardize only int64 columns after train-test split?

I have a dataframe ready for modelling, it contains continuous variables and one-hot-encoded variables我有一个准备建模的数据框,它包含连续变量和单热编码变量

ID   Limit   Bill_Sep  Bill_Aug  Payment_Sep   Payment_Aug   Gender_M   Gender_F  Edu_Uni DEFAULT_PAYMT
1    10000   2000      350       1000          350           1          0         1          1
2    30000   3000      5000      500           500           0          1         0          0
3    20000   8000      10000     8000          5000          1          0         1          1
4    45000   450       250       450           250           0          1         0          1
5    60000   700       1000      700           1000          1          0         1          1
6    8000    300       5000      300           2000          1          0         1          0
7    30000   3000      10000     1000          5000          0          1         1          1
8    15000   1000      1250      500           1750          0          1         1          1

All the numerical variables are 'int64' while the one-hot-encoded variables are 'uint8'.所有数值变量都是'int64',而one-hot-encoded 变量是'uint8'。 The binary outcome variable is DEFAULT_PAYMT.二元结果变量是 DEFAULT_PAYMT。

I have gone down the usual manner of train test split here, but i wanted to see if i could apply the standardscaler only for the int64 variables (ie, the variables that were not one-hot-encoded)?我在这里采用了通常的训练测试拆分方式,但我想看看是否可以仅对 int64 变量(即不是单热编码的变量)应用标准缩放器?

featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

Am attempting the following code and seems to work, however, am not sure how to merge the categorical variables (that were not scaled) back into the X_scaled_tr and X_scaled_t arrays.我正在尝试以下代码并且似乎有效,但是,我不确定如何将分类变量(未缩放的)合并回 X_scaled_tr 和 X_scaled_t 数组。 Appreciate any form of help, thank you!感谢任何形式的帮助,谢谢!

featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

sc = StandardScaler()
X_scaled_tr = X_train.select_dtypes(include=['int64'])
X_scaled_t = X_test.select_dtypes(include=['int64'])

X_scaled_tr = sc.fit_transform(X_scaled_tr)
X_scaled_t = sc.transform(X_scaled_t)

Managed to address the question with the following code where standardscaler is only applied to the continuous variables and NOT the one-hot-encoded variables设法使用以下代码解决了这个问题,其中标准缩放器仅应用于连续变量而不是单热编码变量

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('X_train', StandardScaler(), ['LIMIT','BILL_SEP','BILL_AUG','PAYMENT_SEP','PAYMENT_AUG'])], remainder ='passthrough')

X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何意外地训练测试拆分和交叉验证? - How to train-test split and cross-validate in surprise? 如何将 int64 列转换为日期时间列 - how do i convert int64 column to datetime column 为什么每次我在这个特定的数据集上运行训练测试拆分时,我的 kernel 都会死掉? - Why does my kernel die every time I run train-test split on this particular dataset? 如何在 Pytorch 中训练测试拆分 - How do I train test split in Pytorch 如何将 interval[int64,right) 类型的列拆分为 Pandas 中的两列 - How to split column of type interval[int64,right) onto two columns in Pandas 我该如何解决:“FitFailedWarning:估计器拟合失败。这些参数在此训练测试分区上的分数将设置为 nan?” - How do I fix: "FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan?" 用于 LSTM 的时间序列数据的训练测试拆分 - Train-Test split for Time Series Data to be used for LSTM 火车测试拆分似乎在Python中无法正常工作? - Train-test split does not seem to work properly in Python? 关于时间序列中训练测试拆分的问题 - Question about Train-Test Split in Time Series 时间序列数据中 LSTM 训练测试拆分中的问题 - Problem in LSTM train-test split in time series data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM