简体   繁体   中英

r subtract mean and divide by standard deviation on few variables

I am trying to standardize certain columns within a dataframe, not all columns. By standardizing I mean,subtracting the mean and dividing by the standard deviation. My question is how can I do this standardization for values in only column 1,2, 4 and 6 assuming I am dealing with this data(mtcars) dataset.

I can do this manually but I am curios to know if there is an efficient way of doing this.

scale does this for you. So


will keep the other variables unchanged. scale returns the mean and sd as attributes that you can use to reverse the process.

mt <- mtcars
# 'data.frame': 32 obs. of  11 variables:
#  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#  $ disp: num  160 160 108 258 360 ...
#  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#  $ qsec: num  16.5 17 18.6 19.4 17 ...
#  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The trick is to subset it both in the *apply call as well as in the reassignment (left of the <- or = ).

mysd <- 3 # something important

mt[c(1,2,4,6)] <- lapply(mt[c(1,2,4,6)], `+`, mysd)
# 'data.frame': 32 obs. of  11 variables:
#  $ mpg : num  24 24 25.8 24.4 21.7 21.1 17.3 27.4 25.8 22.2 ...
#  $ cyl : num  9 9 7 9 11 9 11 7 7 9 ...
#  $ disp: num  160 160 108 258 360 ...
#  $ hp  : num  113 113 96 113 178 108 248 65 98 126 ...
#  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt  : num  5.62 5.88 5.32 6.21 6.44 ...
#  $ qsec: num  16.5 17 18.6 19.4 17 ...
#  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Note that the return from lapply will be a list , not a data.frame . Though it is often sufficiently identical in its behavior, you can wrap it with as.data.frame(lapply(...)) to return it to the original class.

A popular method of doing a single modification to multiple columns is to form a logical vector (can be safer than integers), such as this over-simplified example. The use of the vector makes the subsequent reassignment arguably easier to read.

vec <- sapply(mt, function(x) min(x)>10)
mt[vec] <- lapply(mt[vec], `+`, mysd)

(Using integers becomes less predictable/robust if the vector of integers includes anything below 1 or above the number of columns. It works fine with integer(0) , so feel free to use ints if desired.)

One nice side-effect of this is that if the function is "expensive" (time or resources), then it only operates on the relevant columns. If nothing is selected, nothing is done.

vec <- sapply(mt, function(x) min(x) > 300)
# [1] FALSE
system.time( mt[vec] <- lapply(mt[vec], function(x) { Sys.sleep(100); x+1; }) )
#    user  system elapsed 
#       0       0       0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM