简体   繁体   中英

How to standardize a data frame which contains both numeric and factor variables

My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.

> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.

mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))

Thanks in advance

Here is one option to standardize

 mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
                     scale(x, center=TRUE, scale=TRUE)
                      } else x)

您可以使用 dplyr 包来执行此操作:

mydata2%>%mutate_if(is.numeric,scale)

Here are some options to consider, although it is answered late:

# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)

# Set working directory
setwd("path")

# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39), 
                 "Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
                 "Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
                 "Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
                 "Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
                 "Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))

Let us check the structure of df:

str(df)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  21 19 25 34 45 63 39 28 50 39
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  2138 1516 2213 2500 2660 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num  60 70 88 48 71 51 65 44 53 91

We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).

Let us scale just the numeric variables using only base R:

1) Option: (slight modification of what akrun has proposed here)

start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  (x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1

Time difference of 0.02717805 secs
str(df1)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

2) Option: (akrun's approach)

start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2

Time difference of 0.02599907 secs
str(df2)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

3) Option:

start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3

Time difference of -59.6766 secs
str(df3)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

4) Option (using tidyverse and invoking dplyr):

library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4

Time difference of 0.012043 secs
str(df4)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,

str(df4$Age)
 num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
 - attr(*, "scaled:center")= num 36.3
 - attr(*, "scaled:scale")= num 13.8

Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.

To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:

library(tidyverse)

start_time4 <- Sys.time()

df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)

end_time4 <- Sys.time()

end_time4 - start_time4

with

Time difference of 0.01400399 secs

str(df4)

'data.frame':   10 obs. of  6 variables:

 $ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...


 $ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6

 $ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...

 $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2

 $ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...

 $ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM