使用样本进行逐行变异的有效方法

Question

For each 0 in x , I want to randomly insert a number between 1:10 but i'm looking for an efficent way to do this in dplyr and/or data.table as I have a very large dataset (10m rows).对于x每个0 ，我想在 1:10 之间随机插入一个数字，但我正在寻找一种在dplyr和/或data.table执行此操作的有效方法，因为我有一个非常大的数据集（10m 行）。

library(tidyverse)
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
df
#     x
# 1   1
# 2   2
# 3   3
# 4   0
# 5   5
# 6   0
# 7   7
# 8   8
# 9   9
# 10 10

This doesnt work as it replaces each year with the same value:这不起作用，因为它每年都用相同的值替换：

set.seed(1)
df %>% 
  mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
#     x x2
# 1   1  1
# 2   2  2
# 3   3  3
# 4   0  9
# 5   5  5
# 6   0  9
# 7   7  7
# 8   8  8
# 9   9  9
# 10 10 10

It can be achieved though with rowwise but is slow on a large dataset:虽然可以通过rowwise实现，但在大型数据集上速度很慢：

set.seed(1)
#use rowwise
df %>% 
  rowwise() %>% 
  mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
#        x    x2
#    <dbl> <dbl>
#  1     1     1
#  2     2     2
#  3     3     3
#  4     0     9
#  5     5     5
#  6     0     4
#  7     7     7
#  8     8     8
#  9     9     9
# 10    10    10

Any suggestions to speed this up?有什么建议可以加快速度吗？

Thanks谢谢

Answer 1

Not in tidyverse, but you could just do something like this:不在 tidyverse 中，但您可以执行以下操作：

is_zero <- (df$x == 0)
replacements <- sample(1:10, sum(is_zero))

df$x[is_zero] <- replacements

Of course, you can collapse that down if you'd like.当然，如果你愿意，你可以把它折叠起来。

df$x[df$x == 0] <- sample(1:10, sum(df$x == 0))

Answer 2

Using the above solutions and microbenchmark and a slight modification to the dataset for setup:使用上述解决方案和微microbenchmark并对数据集稍作修改以进行设置：

library(data.table)
library(tidyverse)
df <- data.frame(x = 1:100000, y = rbinom(100000, size = 1, 0.5)) %>% 
  mutate(x = ifelse(y == 0, 0, x)) %>% 
  dplyr::select(-y)
dt <- setDT(df)


test <- microbenchmark::microbenchmark(
  base1 = {
    df$x[df$x == 0] <- sample(1:10, sum(df$x == 0), replace = T)
  },
  dplyr1 = {
     df %>% 
      mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0), replace = T)))
  },
  dplyr2 = {
    df %>% group_by(id=row_number()) %>%
      mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1, replace = T), .)) )) %>%
      ungroup() %>% select(-id)
  },
  data.table = {
    dt[x == 0, x := sample(1:10, .N, replace = T)]
  },
  times = 500L
)
test
# Unit: microseconds
#        expr        min         lq          mean      median         uq        max neval cld
#       base1      733.7      785.9      979.0938      897.25     1137.0     1839.4   500  a 
#      dplyr1     5207.1     5542.1     6129.2276     5967.85     6476.0    21790.7   500  a 
#      dplyr2 15963406.4 16156889.2 16367969.8704 16395715.00 16518252.9 19276215.5   500  b
#  data.table     1547.4     2229.3     2422.1278     2455.60     2573.7    15076.0   500  a

I thought data.table would be quickest but the base solution seems best (assuming I've set up the mircobenchmark correctly?).我认为data.table会最快，但基本解决方案似乎是最好的（假设我已经正确设置了mircobenchmark ？）。

EDIT based on @chinsoon12 comment根据@chinsoon12 评论进行编辑

1e5 rows: 1e5行：

Unit: microseconds
       expr    min      lq     mean  median      uq     max neval cld
      base1  730.4  839.30 1380.465 1238.00 1322.85 28977.3   500  a 
 data.table 1394.8 1831.85 2030.215 1946.95 2060.40 29821.9   500  b

1e6 rows: 1e6行：

Unit: milliseconds
       expr    min      lq      mean   median       uq      max neval cld
      base1 9.8703 11.6596 16.030715 11.76195 12.04145 326.0118   500  b
 data.table 2.3772  2.7939  3.855672  3.04700  3.25900  61.4083   500  a

data.table is the quickest data.table是最快的

Answer 3

Here is a data.table option using similar logic to Adam's answer.这是一个data.table选项，使用与 Adam 的答案类似的逻辑。 This filters for rows that meet your criteria: x == 0 , and then samples 1:10 .N times (which, without a grouping variable, is the number of rows of the filtered data.table ).这将过滤符合您条件的行： x == 0 ，然后采样1:10 .N次（没有分组变量，这是过滤后的data.table的行数）。

library(data.table)

set.seed(1)

setDT(df)[x == 0, x := sample(1:10, .N)]
df
     x
 1:  1
 2:  2
 3:  3
 4:  9
 5:  5
 6:  4
 7:  7
 8:  8
 9:  9
10: 10

Answer 4

Maybe try with across() from dplyr in this way:也许以这种方式尝试从dplyr across() ：

library(tidyverse)
#Data
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
#Code
df %>% group_by(id=row_number()) %>%
  mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1), .)) )) %>%
  ungroup() %>% select(-id)

Output:输出：

# A tibble: 10 x 2
       x  x_x2
   <dbl> <dbl>
 1     1     1
 2     2     2
 3     3     3
 4     0     5
 5     5     5
 6     0     6
 7     7     7
 8     8     8
 9     9     9
10    10    10

Answer 5

I am adding a different answer because there are already votes on the base option I provided.我添加了一个不同的答案，因为我提供的基本选项已经有了投票。 But here can be a dplyr way using replace .但这里可以是使用replace的dplyr方式。

library(dplyr)

df %>% 
  mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0))))

使用样本进行逐行变异的有效方法

问题描述

5 个解决方案

解决方案1
2 已采纳 2020-10-12 16:23:10

解决方案2
2 2020-10-12 19:54:39

解决方案3
1 2020-10-12 16:30:31

解决方案4
1 2020-10-12 16:30:55

解决方案5
1 2020-10-12 16:31:54

使用样本进行逐行变异的有效方法

问题描述

5 个解决方案

解决方案1 2 已采纳 2020-10-12 16:23:10

解决方案2 2 2020-10-12 19:54:39

解决方案3 1 2020-10-12 16:30:31

解决方案4 1 2020-10-12 16:30:55

解决方案5 1 2020-10-12 16:31:54

解决方案1
2 已采纳 2020-10-12 16:23:10

解决方案2
2 2020-10-12 19:54:39

解决方案3
1 2020-10-12 16:30:31

解决方案4
1 2020-10-12 16:30:55

解决方案5
1 2020-10-12 16:31:54