简体   繁体   中英

efficient way to rowwise mutate with sample

For each 0 in x , I want to randomly insert a number between 1:10 but i'm looking for an efficent way to do this in dplyr and/or data.table as I have a very large dataset (10m rows).

library(tidyverse)
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
df
#     x
# 1   1
# 2   2
# 3   3
# 4   0
# 5   5
# 6   0
# 7   7
# 8   8
# 9   9
# 10 10

This doesnt work as it replaces each year with the same value:

set.seed(1)
df %>% 
  mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
#     x x2
# 1   1  1
# 2   2  2
# 3   3  3
# 4   0  9
# 5   5  5
# 6   0  9
# 7   7  7
# 8   8  8
# 9   9  9
# 10 10 10

It can be achieved though with rowwise but is slow on a large dataset:

set.seed(1)
#use rowwise
df %>% 
  rowwise() %>% 
  mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
#        x    x2
#    <dbl> <dbl>
#  1     1     1
#  2     2     2
#  3     3     3
#  4     0     9
#  5     5     5
#  6     0     4
#  7     7     7
#  8     8     8
#  9     9     9
# 10    10    10

Any suggestions to speed this up?

Thanks

Not in tidyverse, but you could just do something like this:

is_zero <- (df$x == 0)
replacements <- sample(1:10, sum(is_zero))

df$x[is_zero] <- replacements

Of course, you can collapse that down if you'd like.

df$x[df$x == 0] <- sample(1:10, sum(df$x == 0))

Using the above solutions and microbenchmark and a slight modification to the dataset for setup:

library(data.table)
library(tidyverse)
df <- data.frame(x = 1:100000, y = rbinom(100000, size = 1, 0.5)) %>% 
  mutate(x = ifelse(y == 0, 0, x)) %>% 
  dplyr::select(-y)
dt <- setDT(df)


test <- microbenchmark::microbenchmark(
  base1 = {
    df$x[df$x == 0] <- sample(1:10, sum(df$x == 0), replace = T)
  },
  dplyr1 = {
     df %>% 
      mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0), replace = T)))
  },
  dplyr2 = {
    df %>% group_by(id=row_number()) %>%
      mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1, replace = T), .)) )) %>%
      ungroup() %>% select(-id)
  },
  data.table = {
    dt[x == 0, x := sample(1:10, .N, replace = T)]
  },
  times = 500L
)
test
# Unit: microseconds
#        expr        min         lq          mean      median         uq        max neval cld
#       base1      733.7      785.9      979.0938      897.25     1137.0     1839.4   500  a 
#      dplyr1     5207.1     5542.1     6129.2276     5967.85     6476.0    21790.7   500  a 
#      dplyr2 15963406.4 16156889.2 16367969.8704 16395715.00 16518252.9 19276215.5   500  b
#  data.table     1547.4     2229.3     2422.1278     2455.60     2573.7    15076.0   500  a 

I thought data.table would be quickest but the base solution seems best (assuming I've set up the mircobenchmark correctly?).

EDIT based on @chinsoon12 comment

1e5 rows:

Unit: microseconds
       expr    min      lq     mean  median      uq     max neval cld
      base1  730.4  839.30 1380.465 1238.00 1322.85 28977.3   500  a 
 data.table 1394.8 1831.85 2030.215 1946.95 2060.40 29821.9   500  b

1e6 rows:

Unit: milliseconds
       expr    min      lq      mean   median       uq      max neval cld
      base1 9.8703 11.6596 16.030715 11.76195 12.04145 326.0118   500  b
 data.table 2.3772  2.7939  3.855672  3.04700  3.25900  61.4083   500  a 

data.table is the quickest

Here is a data.table option using similar logic to Adam's answer. This filters for rows that meet your criteria: x == 0 , and then samples 1:10 .N times (which, without a grouping variable, is the number of rows of the filtered data.table ).

library(data.table)

set.seed(1)

setDT(df)[x == 0, x := sample(1:10, .N)]
df
     x
 1:  1
 2:  2
 3:  3
 4:  9
 5:  5
 6:  4
 7:  7
 8:  8
 9:  9
10: 10

Maybe try with across() from dplyr in this way:

library(tidyverse)
#Data
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
#Code
df %>% group_by(id=row_number()) %>%
  mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1), .)) )) %>%
  ungroup() %>% select(-id)

Output:

# A tibble: 10 x 2
       x  x_x2
   <dbl> <dbl>
 1     1     1
 2     2     2
 3     3     3
 4     0     5
 5     5     5
 6     0     6
 7     7     7
 8     8     8
 9     9     9
10    10    10

I am adding a different answer because there are already votes on the base option I provided. But here can be a dplyr way using replace .

library(dplyr)

df %>% 
  mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0))))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM