preProc = c(“center”, “scale”) 在插入符号的 package (R) 和 min-max 归一化中的含义

Question

I am wondering how preProc can be used within the train() function of caret .我想知道如何在 caret 的train() preProc中使用caret 。 I am running a neural network in the train() function using neuralnet .我正在使用神经网络在train() function 中运行neuralnet网络。 The code comes from this question .代码来自这个问题。

This is actually the code:这实际上是代码：

nn <- train(medv ~ ., 
            data = df, 
            method = "neuralnet", 
            tuneGrid = grid,
            metric = "RMSE",
            preProc = c("center", "scale", "nzv"), #good idea to do this with neural nets - your error is due to non scaled data
            trControl = trainControl(
              method = "cv",
              number = 5,
              verboseIter = TRUE)
            )

The original data is not scaled, so that it is recommended to scale the data before running the neural network.原始数据没有缩放，因此建议在运行神经网络之前对数据进行缩放。

However, in the argument preProc appears three elements: center , scale , nzv .然而，在参数preProc中出现了三个元素： center 、 scale 、 nzv 。 I am having problems interpreting those values, as I do not know why they are present.我在解释这些值时遇到问题，因为我不知道它们为什么存在。 Furthermore, I would like to scale/normalize my data using min-max.此外，我想使用 min-max 缩放/标准化我的数据。 This would be the function:这将是 function：

maxs = apply(pk_dc_only$C, 2, max)
mins = apply(pk_dc_only$C, 2, min)
scaled = as.data.frame(scale(df, center = mins, scale = maxs - mins))

Is it possible to normalize my data using min-max scaling within preProc ?是否可以在preProc中使用 min-max 缩放来标准化我的数据？

And if so, how could I undo the scaling when predicting?如果是这样，我如何在预测时撤消缩放？

Answer 1

The three options c("center", "scale", "nzv") does scale and center, in the vignette :三个选项 c("center", "scale", "nzv") 在小插图中进行缩放和居中：

method = "center" subtracts the mean of the predictor's data (again from the data in x) from the predictor values while method = "scale" divides by the standard deviation. method = "center" 从预测变量值中减去预测变量数据的平均值（再次从 x 中的数据），而 method = "scale" 除以标准差。

And nzv basically excludes variables that have near zero variance, meaning they are almost constant and most likely not useful for prediction. nzv基本上排除了方差接近于零的变量，这意味着它们几乎是恒定的，并且很可能对预测没有用处。 To do min max, there is an option:要做 min max，有一个选项：

The "range" transformation scales the data to be within 'rangeBounds'. “范围”转换将数据缩放到“范围边界”内。 If new samples have values larger or smaller than those in the training set, values will be outside of this range.如果新样本的值大于或小于训练集中的值，则值将超出此范围。

we try it below:我们在下面尝试：

library(mlbench)
data(BostonHousing)
library(caret)

idx = sample(nrow(BostonHousing),400)
df = BostonHousing[idx,]
df$chas = as.numeric(df$chas)
pre_mdl = preProcess(df,method="range")

nn <- train(medv ~ ., data = predict(pre_mdl,df),
method = "neuralnet",tuneGrid=G,
metric = "RMSE",trControl = trainControl(
method = "cv",number = 5,verboseIter = TRUE))

nn$preProcess
Created from 400 samples and 13 variables

Pre-processing:
  - ignored (0)
  - re-scaling to [0, 1] (13)

summary(nn$finalModel$data)


          crim                zn             indus             chas       
 Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.000821   1st Qu.:0.0000   1st Qu.:0.1646   1st Qu.:0.0000  
 Median :0.002454   Median :0.0000   Median :0.2969   Median :0.0000  
 Mean   :0.042130   Mean   :0.1309   Mean   :0.3804   Mean   :0.0625  
 3rd Qu.:0.039150   3rd Qu.:0.2000   3rd Qu.:0.6466   3rd Qu.:0.0000  
 Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
      nox               rm              age              dis         
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.1276   1st Qu.:0.4470   1st Qu.:0.4032   1st Qu.:0.08522  
 Median :0.2819   Median :0.5076   Median :0.7503   Median :0.20133  
 Mean   :0.3363   Mean   :0.5232   Mean   :0.6647   Mean   :0.25146  
 3rd Qu.:0.4918   3rd Qu.:0.5880   3rd Qu.:0.9361   3rd Qu.:0.38622  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
      rad              tax            ptratio             b         
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.1304   1st Qu.:0.1770   1st Qu.:0.5106   1st Qu.:0.9475  
 Median :0.1739   Median :0.2729   Median :0.6862   Median :0.9861  
 Mean   :0.3676   Mean   :0.4171   Mean   :0.6243   Mean   :0.8987  
 3rd Qu.:1.0000   3rd Qu.:0.9141   3rd Qu.:0.8085   3rd Qu.:0.9983  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     lstat           .outcome     
 Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.1492   1st Qu.:0.2683  
 Median :0.2705   Median :0.3644  
 Mean   :0.3069   Mean   :0.3902  
 3rd Qu.:0.4220   3rd Qu.:0.4450  
 Max.   :1.0000   Max.   :1.0000

Not very sure what you mean by "undo the scaling when predicting".不太确定“预测时撤消缩放”是什么意思。 Maybe you meant translating them back to the original scale:也许您的意思是将它们翻译回原始比例：

test = BostonHousing[-idx,]
test$chas = as.numeric(test$chas)
test_medv = test$medv
test = predict(pre_mdl,test)

The range is stored under the preProcess model, under范围存储在 preProcess model 下，在

pre_mdl$ranges
         crim  zn indus chas   nox    rm   age     dis rad tax ptratio      b
[1,]  0.00632   0  0.46    1 0.385 3.561   2.9  1.1691   1 187    12.6   0.32
[2,] 88.97620 100 27.74    2 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
     lstat medv
[1,]  1.73    5
[2,] 36.98   50

So we write a wrapper:所以我们写了一个包装器：

convert_response = function(value,mdl,method,column){
bounds = mdl[[method]][,column]
value*diff(bounds) + min(bounds)
}

plot(test_medv,convert_response(predict(nn,test),pre_mdl,"ranges","medv"),
ylab="predicted")

preProc = c(“center”, “scale”) 在插入符号的 package (R) 和 min-max 归一化中的含义

问题描述

1 个解决方案

解决方案1
0 2020-06-05 20:59:14

preProc = c(“center”, “scale”) 在插入符号的 package (R) 和 min-max 归一化中的含义

问题描述

1 个解决方案

解决方案1 0 2020-06-05 20:59:14

解决方案1
0 2020-06-05 20:59:14