简体   繁体   中英

r subtract mean and divide by standard deviation on few variables

I am trying to standardize certain columns within a dataframe, not all columns. By standardizing I mean,subtracting the mean and dividing by the standard deviation. My question is how can I do this standardization for values in only column 1,2, 4 and 6 assuming I am dealing with this data(mtcars) dataset.

I can do this manually but I am curios to know if there is an efficient way of doing this.

scale does this for you. So

df<-mtcars
df[,c(1,2,4,6)]<-scale(df[,c(1,2,4,6)])

will keep the other variables unchanged. scale returns the mean and sd as attributes that you can use to reverse the process.

mt <- mtcars
str(mt)
# 'data.frame': 32 obs. of  11 variables:
#  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#  $ disp: num  160 160 108 258 360 ...
#  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#  $ qsec: num  16.5 17 18.6 19.4 17 ...
#  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The trick is to subset it both in the *apply call as well as in the reassignment (left of the <- or = ).

mysd <- 3 # something important

mt[c(1,2,4,6)] <- lapply(mt[c(1,2,4,6)], `+`, mysd)
str(mt)
# 'data.frame': 32 obs. of  11 variables:
#  $ mpg : num  24 24 25.8 24.4 21.7 21.1 17.3 27.4 25.8 22.2 ...
#  $ cyl : num  9 9 7 9 11 9 11 7 7 9 ...
#  $ disp: num  160 160 108 258 360 ...
#  $ hp  : num  113 113 96 113 178 108 248 65 98 126 ...
#  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt  : num  5.62 5.88 5.32 6.21 6.44 ...
#  $ qsec: num  16.5 17 18.6 19.4 17 ...
#  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Note that the return from lapply will be a list , not a data.frame . Though it is often sufficiently identical in its behavior, you can wrap it with as.data.frame(lapply(...)) to return it to the original class.

A popular method of doing a single modification to multiple columns is to form a logical vector (can be safer than integers), such as this over-simplified example. The use of the vector makes the subsequent reassignment arguably easier to read.

vec <- sapply(mt, function(x) min(x)>10)
mt[vec] <- lapply(mt[vec], `+`, mysd)

(Using integers becomes less predictable/robust if the vector of integers includes anything below 1 or above the number of columns. It works fine with integer(0) , so feel free to use ints if desired.)

One nice side-effect of this is that if the function is "expensive" (time or resources), then it only operates on the relevant columns. If nothing is selected, nothing is done.

vec <- sapply(mt, function(x) min(x) > 300)
any(vec)
# [1] FALSE
system.time( mt[vec] <- lapply(mt[vec], function(x) { Sys.sleep(100); x+1; }) )
#    user  system elapsed 
#       0       0       0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM