简体   繁体   中英

Optimizing replacement in a data frame

This is an extension of Update pairs of columns based on pattern in their names . Thus, this is partially motivated by curiosity and partially for entertainment.

While developing an answer to that question, it occurred to me that this may be one of those cases where a for loop is more efficient than an *apply function (and I've been looking for a good illustration of the fact that *apply is not necessarily "more efficient" than a well constructed for loop). So I'd like to pose the question again, and ask if anyone is able to write a solution using an *apply function (or purr if that's your thing) that performs better than the for loop I've written below. Performance will be judged on execution time as evaluated via microbenchmark on my laptop (A cheap Windows box running R 3.3.2).

data.table and dplyr suggestions are welcome as well. (I'm already making plans for what I'll do with all the microseconds I save).

The Challenge

Consider the data frame:

col_1 <- c(1,2,NA,4,5)
temp_col_1 <-c(12,2,2,3,4)
col_2 <- c(1,23,423,NA,23)
temp_col_2 <-c(1,2,23,4,5)

df_test <- data.frame(col_1, temp_col_1, col_2, temp_col_2) 
set.seed(pi)
df_test <- df_test[sample(1:nrow(df_test), 1000, replace = TRUE), ]

For each col_x , replace the missing values with the corresponding value in temp_col_x . So, for example:

  col_1 temp_col_1 col_2 temp_col_2
1     1         12     1          1
2     2          2    23          2
3    NA          2   423         23
4     4          3    NA          4
5     5          4    23          5

becomes

  col_1 temp_col_1 col_2 temp_col_2
1     1         12     1          1
2     2          2    23          2
3     2          2   423         23
4     4          3     4          4
5     5          4    23          5

Existing Solutions

The for loop I've already written

temp_cols <- names(df_test)[grepl("^temp", names(df_test))]
cols <- sub("^temp_", "", temp_cols)

for (i in seq_along(temp_cols)){
  row_to_replace <- which(is.na(df_test[[cols[i]]]))
  df_test[[cols[i]]][row_to_replace] <- df_test[[temp_cols[i]]][row_to_replace]
 }

My best apply function so far is:

lapply(names(df_test)[grepl("^temp_", names(df_test))],
       function(tc){
         col <- sub("^temp_", "", tc)
         row_to_replace <- which(is.na(df_test[[col]]))
         df_test[[col]][row_to_replace] <<- df_test[[tc]][row_to_replace]
       })

Benchmarking

As (if) suggestions come in, I will begin showing benchmarks in edits to this question. (edit: code is now a copy of Frank's answer, but run 100 times on my machine, as promised)

library(magrittr)
library(data.table)
library(microbenchmark)
set.seed(pi)

nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
df_r  = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame


microbenchmark(times = 100,
               for_vec = {
                 df_m <- df_m0
                 for (col in 1:nc){
                   w <- which(is.na(df_m[[col]]))
                   df_m[[col]][w] <- df_r[[col]][w]
                 }
               }, lapply_vec = {
                 df_m <- df_m0
                 lapply(seq_along(df_m),
                        function(i){
                          w <- which(is.na(df_m[[i]]))
                          df_m[[i]][w] <<- df_r[[i]][w]
                        })

               }, for_df = {
                 df_m <- df_m0
                 for (col in 1:nc){
                   w <- which(is.na(df_m[[col]]))
                   df_m[w, col] <- df_r[w, col]
                 }
               }, lapply_df = {
                 df_m <- df_m0
                 lapply(seq_along(df_m),
                        function(i){
                          w <- which(is.na(df_m[[i]]))
                          df_m[w, i] <<- df_r[w, i]
                        })
               }, mat = { # in lmo's answer
                 df_m <- df_m0
                 bah = is.na(df_m)
                 df_m[bah] = df_r[bah]
               }, set = {
                 df_m <- copy(df_m0)
                 for (col in 1:nc){
                   w = which(is.na(df_m[[col]]))
                   set(df_m, i = w, j = col, v = df_r[w, col])
                 }
               }
)

Results:

Unit: milliseconds
       expr       min        lq      mean    median        uq      max neval cld
    for_vec 135.83875 157.84548 175.23005 166.60090 176.81839 502.0616   100  b 
 lapply_vec 135.67322 158.99496 179.53474 165.11883 178.06968 551.7709   100  b 
     for_df 173.95971 204.16368 222.30677 212.76608 224.78188 446.6050   100   c
  lapply_df 181.46248 205.57069 220.38911 215.08505 223.98406 381.1006   100   c
        mat 129.27835 154.01248 173.11378 159.83070 169.67439 453.0888   100  b 
        set  66.86402  81.08138  86.32626  85.51029  89.58331 123.1926   100 a  

Data.table provides the set function to modify data.tables or data.frames by reference.

Here's a benchmark that is more flexible with respect to numbers of cols and rows and that sidesteps the awkward column-name stuff in the OP:

library(magrittr)
nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
df_r  = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame

library(data.table)
library(microbenchmark)
microbenchmark(times = 10,
  for_vec = {
    df_m <- df_m0
    for (col in 1:nc){
      w <- which(is.na(df_m[[col]]))
      df_m[[col]][w] <- df_r[[col]][w]
    }
    }, lapply_vec = {
    df_m <- df_m0
    lapply(seq_along(df_m), function(i){
          w <- which(is.na(df_m[[i]]))
          df_m[[i]][w] <<- df_r[[i]][w]
    })
  }, for_df = {
    df_m <- df_m0
    for (col in 1:nc){
      w <- which(is.na(df_m[[col]]))
      df_m[w, col] <- df_r[w, col]
    }
    }, lapply_df = {
    df_m <- df_m0
    lapply(seq_along(df_m), function(i){
          w <- which(is.na(df_m[[i]]))
          df_m[w, i] <<- df_r[w, i]
    })
  }, mat = { # in lmo's answer
    df_m <- df_m0
    bah = is.na(df_m)
    df_m[bah] = df_r[bah]
  }, set = {
    df_m <- copy(df_m0)
    for (col in 1:nc){
      w = which(is.na(df_m[[col]]))
      set(df_m, i = w, j = col, v = df_r[w, col])
    }
  }
)

Which gives...

Unit: milliseconds
       expr       min        lq      mean    median        uq      max neval
    for_vec  77.06501  89.53430 100.10051  96.33764 106.13486 142.1329    10
 lapply_vec  77.67366  89.04438  98.81510  99.08863 108.86491 117.2956    10
     for_df 103.79097 130.33134 140.95398 144.46526 157.11335 161.4507    10
  lapply_df  97.04616 114.17825 126.10633 131.20382 137.64375 149.7765    10
        mat  73.47691  84.51473 100.16745 103.44476 112.58006 128.6166    10
        set  44.32578  49.58586  62.52712  56.30460  71.63432 101.3517    10

Comments:

  • If we adjust nc and nr or the frequency of NA s, the ranking of these four options might change. I guess the more cols there are, the better the mat way (from @lmo's answer) and set way look.

  • The copy in the set test takes some extra time beyond what we'd see in practice, since the set function just modifies the table by reference (unlike the other options, I think).

Here is a readable solution. Probably slower than some.

df_test[c(TRUE, FALSE)][is.na(df_test[c(TRUE, FALSE)])] <- 
df_test[c(FALSE, TRUE)][is.na(df_test[c(TRUE, FALSE)])]

This could be sped up a bit with pre-allocating the replacement so it is only performed once.

filler <- is.na(df_test[c(TRUE, FALSE)])
df_test[c(TRUE, FALSE)][filler] <- df_test[c(FALSE, TRUE)][filler]

In a two data.frame scenario, df1 and df2, this logic would be

filler <- is.na(df1)
df1[filler] <- df2[filler]

Maybe this is naive, but how about neither? I think it's still in the spirit of things if you're just looking for the fastest method. I suspect this won't be it though.

col_1 <- c(1,2,NA,4,5)
temp_col_1 <-c(12,2,2,3,4)
col_2 <- c(1,23,423,NA,23)
temp_col_2 <-c(1,2,23,4,5)

df_test <- data.frame(col_1, temp_col_1, col_2, temp_col_2) 
set.seed(pi)
df_test <- df_test[sample(1:nrow(df_test), 1000, replace = TRUE), ]

df_test$col_1 <- ifelse(is.na(df_test$col_1), df_test$temp_col_1,df_test$col_1)
df_test$col_2 <- ifelse(is.na(df_test$col_2), df_test$temp_col_2,df_test$col_2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM