简体   繁体   中英

How do convert a categorical variable into multiple dummy variables in R?

Here I have a dataset with a column name as Age = (24 or under, 25 to 34, 35 to 44, 45 to 54, 25 to 34, 24 or under,35 to 44, 25 to 34, 45 to 54)

Now I need to convert the values for the categorical variable "Age" as follows: 24 or under equal to 1, 25 to 34 equal to 2, 35 to 44 equal to 3, 45 to 54 equal to 4

Can anyone help me here?

Many thanks in advance.

You can use nested ifelse statements:

set.seed(12)
df <- data.frame(Age = c(sample(c("24 or under", "25 to 34", "35 to 44", "45 to 54"), 20, replace = T)))
df$Age_new <- ifelse(df$Age == "24 or under", 1,
                     ifelse(df$Age == "25 to 34", 2,
                            ifelse(df$Age == "35 to 44", 3, 4)))

Result:

df
           Age Age_new
1     25 to 34       2
2     35 to 44       3
3  24 or under       1
4     45 to 54       4
5  24 or under       1
6     35 to 44       3
7     45 to 54       4
8     25 to 34       2
9     45 to 54       4
10    35 to 44       3
11 24 or under       1
12    35 to 44       3
13    25 to 34       2
14 24 or under       1
15    25 to 34       2
16    35 to 44       3
17    25 to 34       2
18    25 to 34       2
19    35 to 44       3
20    25 to 34       2

As pieterbons described, your Age field is practically a factor already. If you convert Age to type numeric, you'll have your data in numeric categories.

df <- data.frame(Age = c("24 or under", "25 to 34", "35 to 44", "45 to 54"))
df$Age <- as.numeric(df$Age)

You can also create a new field with dummy codes of your Age field as you described ( this option would be particularly helpful if you had a string variable that you wanted to convert to a factor but it had a very distinct order ), there are multiple ways to do this:

# 1) Base R
df$age_new <- as.numeric(df$Age)


# 2) dplyr
library(dplyr)
df <- df %>% 
  mutate(Age = case_when(Age == "24 or under" ~ 1,
                         Age == "25 to 34"    ~ 2,
                         Age == "35 to 44"    ~ 3, 
                         TRUE                 ~ 4))

#> df
#          Age age_new
#1 24 or under       1
#2    25 to 34       2
#3    35 to 44       3
#4    45 to 54       4

If your column Age is a factor, this actually automatically happens behind the screen (each value is stored as an integer and has a corresponding text label). To explicitly get these integers, you can use as.numeric() .

df <- data.frame(Age = c("24 or under", "25 to 34", "35 to 44", "45 to 54"))

df$Age_cat <- as.numeric(df$Age)

You might run into sorting issues if the levels should have a different order than the original one. In that case you can explicitly set the levels of the factor.

If you want a dummy variable (ie 0 or 1) you can use a dplyr::if_else statement to create a new variable for each category:

library(dplyr)

Age = c("24 or under", "25 to 34", "35 to 44", "45 to 54")
data.frame(age = Age) %>%
    mutate("24 or under" = if_else(age == Age[1], 1, 0),
           "25 to 34"    = if_else(age == Age[2], 1, 0),
           "35 to 44"    = if_else(age == Age[3], 1, 0),
           "45 to 54"    = if_else(age == Age[4], 1, 0))

If you want a numeric value instead, code your variable as a factor , set the levels in the order you want, and then use as.numeric :

Age = factor(c("24 or under", "25 to 34", "35 to 44", "45 to 54"),
         levels = c(c("24 or under", "25 to 34", "35 to 44", "45 to 54")))

as.numeric(Age)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM