简体   繁体   中英

how I can use indicator or dummy variable for a factor variable?

I have a column which is income of each household, I want to use a indicator in order to use it in my analysis. I want it to be 1 if income is larger than 35000$ and 0 otherwise.

  Household          INCOM
      1         (5) $50,000 - $74,999
      2         (3) $25,000 - $34,99
      3         (4) $35,000 - $49,999

So indicator variable must be

     IND
      1
      0
      1

I Used the following but of course it didn't work because INCOM is not numerical:

     df %>% mutate(`income` = 1* (INCOM >= 35000), )       

One base R approach could be

df$Ind <- as.integer(sapply(strsplit(sub(".*\\$(\\d+).*\\$(\\d+).*", "\\1-\\2", 
           gsub(",", "", df$INCOM)), "-"), function(x) any(as.numeric(x) > 35000)))

df
#  Household                 INCOM Ind
#1         1 (5) $50,000 - $74,999   1
#2         2  (3) $25,000 - $34,99   0
#3         3 (4) $35,000 - $49,999   1

I tried to do everything in one-liner, let me explain all the commands one-by-one

Using gsub we remove all the commas present in INCOM

gsub(",", "", df$INCOM)
#[1] "(5) $50000 - $74999" "(3) $25000 - $3499"  "(4) $35000 - $49999"

then use sub to extract both the numbers which come after $

sub(".*\\$(\\d+).*\\$(\\d+).*", "\\1-\\2", gsub(",", "", df$INCOM))
#[1] "50000-74999" "25000-3499"  "35000-49999"

We then split the string on -

strsplit(sub(".*\\$(\\d+).*\\$(\\d+).*", "\\1-\\2", gsub(",", "", df$INCOM)), "-")

#[[1]]
#[1] "50000" "74999"

#[[2]]
#[1] "25000" "3499" 

#[[3]]
#[1] "35000" "49999"

and then using sapply convert these numbers to numeric and check if any of the number is greater than 35000 and give 1/0 values accordingly.

We can use gsubfn to get the binary format. We remove the $, with gsub from the 'INCOM', capture the digits in gsubfn , convert it to numeric , do the comparison with the 35000 and extract the binary numbers

library(gsubfn)
df1$ind <- as.integer(sub(".* ", "", gsubfn("(\\d+) - (\\d+)",
    ~ +(any(as.numeric(c(x, y))  > 35000)), gsub("[$,]", "", df1$INCOM))))
 df1$ind
#[1] 1 0 1

Or an option with tidyverse

library(tidyverse)
library(readr)
df1 %>% 
  extract(INCOM, into = c("col1", "col2"), remove = FALSE, 
    ".*\\$(\\d+,\\d+) - \\$(\\d+,\\d+)") %>% 
  mutate_at(vars(starts_with('col')), parse_number) %>%
  mutate(Ind = as.integer(col1 > 35000 | col2 > 35000)) %>% 
  select(-col1, -col2)
#   Household                 INCOM Ind
#1         1 (5) $50,000 - $74,999   1
#2         2  (3) $25,000 - $34,99   0
#3         3 (4) $35,000 - $49,999   1

Or another option is

str_remove_all(df1$INCOM, ",") %>%
      str_extract_all("(?<=[$])([0-9]+)") %>%
      map_int(~ +(any(as.numeric(.x) > 35000)))
#[1] 1 0 1

data

df1 <- structure(list(Household = 1:3, INCOM = c("(5) $50,000 - $74,999", 
"(3) $25,000 - $34,99", "(4) $35,000 - $49,999")), class = "data.frame",
row.names = c(NA, 
-3L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM