简体   繁体   English

如何标准化包含数值和因子变量的数据框

[英]How to standardize a data frame which contains both numeric and factor variables

My data frame, my.data, contains both numeric and factor variables.我的数据框 my.data 包含数字变量和因子变量。 I want to standardise just the numeric variables in this data frame.我只想标准化这个数据框中的数字变量。

> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

Could the standardising work by doing this?标准化工作可以这样做吗? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.我想标准化第 8、9、10、11 和 12 列,但我认为我有错误的代码。

mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))

Thanks in advance提前致谢

Here is one option to standardize这是标准化的一种选择

 mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
                     scale(x, center=TRUE, scale=TRUE)
                      } else x)

您可以使用 dplyr 包来执行此操作:

mydata2%>%mutate_if(is.numeric,scale)

Here are some options to consider, although it is answered late:以下是一些需要考虑的选项,尽管回答晚了:

# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)

# Set working directory
setwd("path")

# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39), 
                 "Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
                 "Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
                 "Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
                 "Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
                 "Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))

Let us check the structure of df:让我们检查一下 df 的结构:

str(df)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  21 19 25 34 45 63 39 28 50 39
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  2138 1516 2213 2500 2660 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num  60 70 88 48 71 51 65 44 53 91

We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).我们看到 Age、Salary、Height 和 Weight 是数字,而 Name 和 Gender 是分类的(因子变量)。

Let us scale just the numeric variables using only base R:让我们仅使用基数 R 来缩放数值变量:

1) Option: (slight modification of what akrun has proposed here) 1)选项:(对akrun在这里提出的内容略有修改)

start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  (x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1

Time difference of 0.02717805 secs
str(df1)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

2) Option: (akrun's approach) 2)选项:(akrun的方法)

start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2

Time difference of 0.02599907 secs
str(df2)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

3) Option: 3) 选项:

start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3

Time difference of -59.6766 secs
str(df3)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

4) Option (using tidyverse and invoking dplyr): 4)选项(使用tidyverse并调用dplyr):

library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4

Time difference of 0.012043 secs
str(df4)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

Based on what kind of structure as output you demand and speed, you can judge.根据你需要什么样的结构作为输出和速度,你可以判断。 If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems.如果你的数据不平衡,你想平衡它,假设你想在缩放数值变量后进行分类,数值变量的矩阵数字结构,即-Age, Salary, Height and Weight会产生问题。 I mean,我的意思是,

str(df4$Age)
 num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
 - attr(*, "scaled:center")= num 36.3
 - attr(*, "scaled:scale")= num 13.8

Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.例如,由于 ROSE 包(平衡数据)不接受除 int、factor 和 num 之外的数据结构,因此会引发错误。

To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:为了避免这个问题,缩放后的数值变量可以通过以下方式保存为向量而不是列矩阵:

library(tidyverse)

start_time4 <- Sys.time()

df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)

end_time4 <- Sys.time()

end_time4 - start_time4

with

Time difference of 0.01400399 secs

str(df4)

'data.frame':   10 obs. of  6 variables:

 $ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...


 $ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6

 $ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...

 $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2

 $ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...

 $ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不丢失变量标签的情况下将大型数据框中的所有因子变量转换为数值变量? - How to convert all factor variables into numeric variables in a large data frame without loosing variables labels? 如何开发一个新的反应数据框架,从另一个反应数据框架中获取列并将数据类型更改为因子或数字? - How to develop a new reactive data frame which takes columns from another reactive data frame and change data types to factor or numeric? 所有因子水平的子集数据框,其中包含向量中的值 - Subset data frame for all factor levels which contains values in vector 如何在数据帧中操作变量 - how to manipulate variables in a factor of a data frame 我如何才能永久性地将数据框数字化? - How can I make a factor in a data frame numeric permanently? R:将列表中数据框的因数转换为数值 - R: Convert a factor of a data frame in a list to numeric 如何舍入R中具有一些非数字变量的数据框? - How to round a data frame in R which have some non-numeric variables? 更改数据框列以从数字分解 - Changing Data frame columns to factor from numeric 如何在 R 中的单个数据框中转换(标准化)类别内的数据? - How to transform (standardize) data within categories in a single data frame in R? 如何将因子有序变量转换为数值 - How to convert factor ordered variables to numeric
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM