简体   繁体   中英

How to implement a generalized “switching equation” in r?

I'm trying to implement a generalized "switching equation" (Gerber and Green 2012, chapter 2) in r.

I have a categorical variable Z, that has k > 2 values.

I have k columns names Y_1, Y_2,... Y_k.

I want to make a variable Y that picks our the "right" values from each column. That is, if Z is 1, put the Y_1 values into Y.

I've got a solution with a loop, but it's annoying. Is there a super sweet way to do this with a one liner? No nested ifelse, pls.

N <- 100

df <- data.frame(
  Z = sample(1:3, N, replace = TRUE),
  Y_1 = rnorm(N),
  Y_2 = rnorm(N),
  Y_3 = rnorm(N)
)

# an annoying solution
df <- within(df,{
  Y <- rep(NA, nrow(df))
  Y[Z == 1] <- Y_1[Z == 1]
  Y[Z == 2] <- Y_2[Z == 2]
  Y[Z == 3] <- Y_3[Z == 3]
})

head(df) 

which yields:

  Z         Y_1        Y_2         Y_3           Y
1 3  0.89124772  1.4377700  0.05226285  0.05226285
2 1  0.89186873 -0.6984839 -0.86141525  0.89186873
3 1 -0.01315678  1.5193461  0.18290065 -0.01315678
4 3 -0.57857274 -1.4445197  2.03764943  2.03764943
5 3 -0.19793692 -0.1818225  1.10270877  1.10270877
6 2  1.48291431  2.7264541  0.70129357  2.72645413

EDIT: I like Weihuang Wong's approach df$Y <- sapply(split(df, 1:nrow(df)), function(x) x[, paste0("Y_", x$Z)]) in part because it doesn't rely on position but rather the column names. All of the offered answers so far use column position.... I'm a tiny bit worried that sapply(split()) is slow, but maybe I'm crazy?

df$Y <- apply(df, 1, function(x) x[x[1]+1] )
head(df)
#  Z        Y_1        Y_2        Y_3          Y
#1 1 -0.8598997 -0.3180947  1.9374462 -0.8598997
#2 2 -0.2392902  0.2266245  0.2364991  0.2266245
#3 1 -0.8733609 -1.3892361  0.3351359 -0.8733609
#4 3 -0.6533548 -1.1042993 -0.2906852 -0.2906852
#5 1 -1.7424126 -0.2101860  0.1198945 -1.7424126
#6 2 -1.9746651 -0.4308746 -0.7849773 -0.4308746

Not exactly 1 line, but

get_result <- function(dfrow){
  x <- unlist(dfrow[,1:4])
  Y <- x[x[1] + 1] 
}

library(purrr)
newdf <- by_row(df, get_result)

This can be done in a vectorized way with row/column indexing

df$Y <- df[-1][cbind(1:nrow(df), df$Z)]
df
#  Z         Y_1        Y_2         Y_3           Y
#1 3  0.89124772  1.4377700  0.05226285  0.05226285
#2 1  0.89186873 -0.6984839 -0.86141525  0.89186873
#3 1 -0.01315678  1.5193461  0.18290065 -0.01315678
#4 3 -0.57857274 -1.4445197  2.03764943  2.03764943
#5 3 -0.19793692 -0.1818225  1.10270877  1.10270877
#6 2  1.48291431  2.7264541  0.70129357  2.72645410

data

df <- structure(list(Z = c(3L, 1L, 1L, 3L, 3L, 2L), Y_1 = c(0.89124772, 
0.89186873, -0.01315678, -0.57857274, -0.19793692, 1.48291431
), Y_2 = c(1.43777, -0.6984839, 1.5193461, -1.4445197, -0.1818225, 
2.7264541), Y_3 = c(0.05226285, -0.86141525, 0.18290065, 2.03764943, 
1.10270877, 0.70129357)), .Names = c("Z", "Y_1", "Y_2", "Y_3"
), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

OP here.

I timed my "annoying" by hand solution against the "ind_split" solution proposed by Weihuang Wong. I also did it by "groups":

N <- 100000

df <- data.frame(
  Z = sample(1:3, N, replace = TRUE),
  Y_1 = rnorm(N),
  Y_2 = rnorm(N),
  Y_3 = rnorm(N)
)


ind_split <- 
system.time({
  df$Y <- sapply(split(df, 1:nrow(df)), function(x) x[, paste0("Y_", x$Z)])
  head(df)
})

revealer <- 
  function(list_element){
    col_name <- paste0("Y_", list_element[1, "Z"])
    list_element$Y <- list_element[,col_name]
    return(list_element)
  }

group_split <- 
system.time({
  split_list <- split(df, df$Z)
  df <- do.call(what = rbind, lapply(split_list, revealer))
  head(df)
})


by_hand <- 
system.time({
  # an annoying solution
  df <- within(df,{
    Y <- rep(NA, nrow(df))
    Y[Z == 1] <- Y_1[Z == 1]
    Y[Z == 2] <- Y_2[Z == 2]
    Y[Z == 3] <- Y_3[Z == 3]
  })
  head(df)
})


ind_split
group_split
by_hand

the timings came in at

> ind_split
   user  system elapsed 
  1.023   0.083   1.136 
> group_split
   user  system elapsed 
  0.011   0.002   0.013 
> by_hand
   user  system elapsed 
  0.001   0.000   0.001 

The annoying by hand method is way faster, which is crazy to me! Splitting by groups is faster than splitting by individuals.

此处的后期添加是通过使用matchnames (基于先前由Akrun提出并由我自己修改的另一个解决方案)从Akrun的答案构建的,但不使用位置编号,而是:

df$Y <- df[cbind(1:nrow(df), match(paste0('Y_', df$Z), names(df)))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM