简体   繁体   English

如何在scikit学习回归中不标准化目标数据

[英]How to not standarize target data in scikit learn regression

I am trying to predict future profit data in a dataset of a copper mine enterprise data in csv format. 我试图以csv格式预测铜矿企业数据的数据集中的未来利润数据。

I read the data: 我读了数据:

data = pd.read_csv('data.csv')

I split the data: 我拆分数据:

data_target = data[target].astype(float)
data_used = data.drop(['Periodo', 'utilidad_operativa_dolar'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(data_used, data_target, test_size=0.4,random_state=33)

Create an svr predictor: 创建一个svr预测器:

clf_svr= svm.SVR(kernel='rbf')

Standarize the data: 标准化数据:

from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(x_train)
scalery = StandardScaler().fit(y_train)

x_train = scalerX.transform(x_train)
y_train = scalery.transform(y_train)
x_test = scalerX.transform(x_test)
y_test = scalery.transform(y_test)

print np.max(x_train), np.min(x_train), np.mean(x_train), np.max(y_train), np.min(y_train), np.mean(y_train)

Then predict: 然后预测:

y_pred=clf.predict(x_test)

And the prediction data is standarized as well. 并且预测数据也是标准化的。 I want the predicted data to be in the original format, how i can do that? 我希望预测数据采用原始格式,我该怎么做?

You would want to use the inverse_transform method of your y-scaler. 您可能希望使用y-scaler的inverse_transform方法。 Note that you can do all this more concisely using a pipeline, as follows 请注意,您可以使用管道更简洁地完成所有这些操作,如下所示

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

pipeline = Pipeline([('scaler', StandardScaler()), ('estimator', SVR(kernel="rbf"))])

y_scaler = StandardScaler()
y_train = y_scaler.fit_transform(y_train)
pipeline.fit(x_train, y_train)
y_pred = y_scaler.inverse_transform(pipeline.predict(x_test))

Many would just scale the target globally and get away without too much overfitting. 许多人只会在全球范围内扩大目标并在没有太多过度拟合的情况下逃脱。 But you are doing good in not falling for this. 但你并没有因此而堕落。 AFAIK using a separate scaler for y data as shown in the code is the only way to go. AFAIK对代码中显示的y数据使用单独的缩放器是唯一的方法。

I know this question is old and the answer was correct at the time, but there is a scikit-learn method of doing this now. 我知道这个问题很老,当时答案是正确的,但现在有一种scikit-learn方法。

http://scikit-learn.org/dev/modules/compose.html#transforming-target-in-regression http://scikit-learn.org/dev/modules/compose.html#transforming-target-in-regression

As others have mentioned already you ought to use inverse_transform() method to retrieve original data from its respective transformation applied prior. 正如其他人已经提到的那样,您应该使用inverse_transform()方法从先前应用的相应变换中检索原始数据。 Another point to ponder is, why the need to transform the Target y_test, y_train if our intention is to predict for the Real Target 'y' values ? 另一个需要思考的问题是,如果我们的目的是预测真实目标的'y'值 ,为什么需要转换目标y_test, y_train We might as well have in it in its original state during prediction. 在预测期间,我们也可能处于原始状态。

Also (in Python 3.7.3, sklearn 0.20.3), when you standardize single columned rows like y_test, y_train like you have done above you inadvertently receive output as an numpy array which won't help in Dataframe operations; 另外(在Python 3.7.3中,sklearn 0.20.3),当你像y_test, y_train那样标准化像y_test, y_train这样的单个列时,你无意中接收输出作为numpy数组,这对Dataframe操作没有帮助;

eg: 例如:

在此输入图像描述

When you do specify that your output should resemble a single columned Dataframe, you may encounter more issues; 当您指定输出应类似于单个圆柱形Dataframe时,您可能会遇到更多问题;

eg: 例如:

在此输入图像描述 在此输入图像描述

SOLUTION : You will have to explicitly state that your Target Column Name/Index in a List using proper subset selection operators (.loc/.iloc). 解决方案 :您将有你的目标列名/在列表索引使用适当的子集选择运营商(的.loc / .iloc)明确规定。

eg: 例如:

在此输入图像描述

NOTE: in live ML projects, Test Data is something that arrives in the future or collected live when your model is ready for tuning during productionizing stage. 注意:在实时ML项目中,当您的模型准备好在生产阶段进行调整时,测试数据将在未来到达或实时收集。

A standardized training-test Features set like X_train, X_test help in easy comparison of Feature variation from mean as well as useful for Regularization and Principal Component Analysis techniques which mandate standardization of feature variables. 标准化的训练测试特征设置如X_train, X_test有助于简单地比较特征变化与平均值,以及对规范特征变量标准化的正则化主成分分析技术有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM