简体   繁体   English

dplyr 变异/替换行子集上的几列

[英]dplyr mutate/replace several columns on a subset of rows

I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to.我正在尝试基于 dplyr 的工作流程(而不是主要使用 data.table,我已经习惯了),我遇到了一个问题,我找不到等效的 dplyr 解决方案。 I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition.我经常遇到需要根据单个条件有条件地更新/替换几列的情况。 Here's some example code, with my data.table solution:这是一些示例代码,以及我的 data.table 解决方案:

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

Is there a simple dplyr solution to this same problem?是否有针对同一问题的简单 dplyr 解决方案? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.我想避免使用 ifelse,因为我不想多次输入条件 - 这是一个简化的示例,但有时会有很多基于单个条件的分配。

Thanks in advance for the help!在此先感谢您的帮助!

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:这些解决方案 (1) 维护管道,(2)覆盖输入和 (3) 只需要指定一次条件:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. 1a) mutate_cond为可以合并到管道中的数据帧或数据表创建一个简单的函数。 This function is like mutate but only acts on the rows satisfying the condition:此函数类似于mutate但仅作用于满足条件的行:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. 1b) mutate_last这是数据帧或数据表的替代函数,它同样类似于mutate但仅在group_by (如下例所示),并且仅对最后一组而不是每个组进行操作。 Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.请注意 TRUE > FALSE 所以如果group_by指定了一个条件,那么mutate_last将只对满足该条件的行进行操作。

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2) factor out condition Factor out the condition by making it an extra column which is later removed. 2)分解条件通过将条件分解为一个额外的列,稍后将其删除。 Then use ifelse , replace or arithmetic with logicals as illustrated.然后使用ifelse ,用逻辑replace或算术,如图所示。 This also works for data tables.这也适用于数据表。

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579 ). 3) sqldf我们可以通过管道中的 sqldf 包对数据帧使用 SQL update (但不能使用数据表,除非我们转换它们——这可能代表 dplyr 中的错误。请参阅dplyr 问题 1579 )。 It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.由于update的存在,我们似乎不合需要地修改了此代码中的输入,但实际上update作用于临时生成的数据库中的输入副本,而不是实际输入。

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

4) row_case_when Also check out row_case_when defined in Returning a tibble: how to vectorize with case_when? 4)row_case_when还检查了row_case_when定义返回一个tibble:如何与case_when矢量化? . . It uses a syntax similar to case_when but applies to rows.它使用类似于case_when的语法,但适用于行。

library(dplyr)

DF %>%
  row_case_when(
    measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
    TRUE ~ data.frame(qty.exit, cf, delta.watts)
  )

Note 1: We used this as DF注 1:我们将其用作DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134 , 631 , 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.注 2:如何轻松指定更新行子集的问题也在 dplyr 问题13463115181573 中讨论,其中631是主线程, 1573是对此处答案的回顾。

You can do this with magrittr 's two-way pipe %<>% :您可以使用magrittr的双向管道%<>%

library(dplyr)
library(magrittr)

dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                    cf = 0,  
                                    delta.watts = 13)

This reduces the amount of typing, but is still much slower than data.table .这减少了输入量,但仍然比data.table慢得多。

Here's a solution I like:这是我喜欢的解决方案:

mutate_when <- function(data, ...) {
  dots <- eval(substitute(alist(...)))
  for (i in seq(1, length(dots), by = 2)) {
    condition <- eval(dots[[i]], envir = data)
    mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
    data[condition, names(mutations)] <- mutations
  }
  data
}

It lets you write things like eg它可以让你写一些东西,例如

mtcars %>% mutate_when(
  mpg > 22,    list(cyl = 100),
  disp == 160, list(cyl = 200)
)

which is quite readable -- although it may not be as performant as it could be.这是非常易读的——尽管它可能没有它应有的性能。

As eipi10 shows above, there's not a simple way to do a subset replacement in dplyr because DT uses pass-by-reference semantics vs dplyr using pass-by-value.正如上面的 eipi10 所示,在 dplyr 中没有一种简单的方法来进行子集替换,因为 DT 使用传递引用语义,而 dplyr 使用传递值。 dplyr requires the use of ifelse() on the whole vector, whereas DT will do the subset and update by reference (returning the whole DT). dplyr 需要在整个向量上使用ifelse() ,而 DT 将执行子集并通过引用更新(返回整个 DT)。 So, for this exercise, DT will be substantially faster.所以,对于这个练习,DT 会快很多。

You could alternatively subset first, then update, and finally recombine:您也可以先子集,然后更新,最后重新组合:

dt.sub <- dt[dt$measure == "exit",] %>%
  mutate(qty.exit= qty, cf= 0, delta.watts= 13)

dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])

But DT is gonna be substantially faster: (editted to use eipi10's new answer)但是 DT 会快得多:(编辑为使用 eipi10 的新答案)

library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit', 
                            `:=`(qty.exit = qty,
                                 cf = 0,
                                 delta.watts = 13)]},
               eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                cf = 0,  
                                delta.watts = 13)},
               alex= {dt.sub <- dt[dt$measure == "exit",] %>%
                 mutate(qty.exit= qty, cf= 0, delta.watts= 13)

               dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})


Unit: microseconds
expr      min        lq      mean   median       uq      max neval cld
     dt  591.480  672.2565  747.0771  743.341  780.973 1837.539   100  a 
 eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509   100   b
   alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427   100   b

I just stumbled across this and really like mutate_cond() by @G.我只是偶然发现了这个,真的很喜欢@G 的mutate_cond() Grothendieck, but thought it might come in handy to also handle new variables. Grothendieck,但认为处理新变量可能会派上用场。 So, below has two additions:所以,下面有两个补充:

Unrelated: Second last line made a bit more dplyr by using filter()无关:倒数第二行通过使用filter()使dplyrdplyr

Three new lines at the beginning get variable names for use in mutate() , and initializes any new variables in the data frame before mutate() occurs.开头的三个新行获取用于mutate()变量名称,并在mutate()发生之前初始化数据框中的任何新变量。 New variables are initialized for the remainder of the data.frame using new_init , which is set to missing ( NA ) as a default.使用new_initdata.frame的其余部分初始化新变量,默认设置为缺失 ( NA )。

mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
  # Initialize any new variables as new_init
  new_vars <- substitute(list(...))[-1]
  new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
  .data[, new_vars] <- new_init

  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
  .data
}

Here are some examples using the iris data:以下是一些使用虹膜数据的示例:

Change Petal.Length to 88 where Species == "setosa" .Petal.Length更改为 88,其中Species == "setosa" This will work in the original function as well as this new version.这将适用于原始功能以及这个新版本。

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)

Same as above, but also create a new variable x ( NA in rows not included in the condition).与上面相同,但还要创建一个新变量x (条件中未包含的行中的NA )。 Not possible before.以前不可能。

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

Same as above, but rows not included in the condition for x are set to FALSE.同上,但不包含在x条件中的行被设置为 FALSE。

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

This example shows how new_init can be set to a list to initialize multiple new variables with different values.此示例显示如何将new_init设置为list以初始化具有不同值的多个新变量。 Here, two new variables are created with excluded rows being initialized using different values ( x initialised as FALSE , y as NA )在这里,创建了两个新变量,其中排除的行使用不同的值进行初始化( x初始化为FALSEyNA

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
                  x = TRUE, y = Sepal.Length ^ 2,
                  new_init = list(FALSE, NA))

One concise solution would be to do the mutation on the filtered subset and then add back the non-exit rows of the table:一种简洁的解决方案是对过滤后的子集进行变异,然后添加回表的非退出行:

library(dplyr)

dt %>% 
    filter(measure == 'exit') %>%
    mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
    rbind(dt %>% filter(measure != 'exit'))

mutate_cond is a great function, but it gives an error if there is an NA in the column(s) used to create the condition. mutate_cond 是一个很棒的函数,但是如果用于创建条件的列中存在 NA,则会出现错误。 I feel that a conditional mutate should simply leave such rows alone.我觉得条件变异应该简单地留下这样的行。 This matches the behavior of filter(), which returns rows when the condition is TRUE, but omits both rows with FALSE and NA.这与 filter() 的行为相匹配,它在条件为 TRUE 时返回行,但忽略带有 FALSE 和 NA 的两行。

With this small change the function works like a charm:有了这个小小的改变,这个功能就像一个魅力:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
    condition <- eval(substitute(condition), .data, envir)
    condition[is.na(condition)] = FALSE
    .data[condition, ] <- .data[condition, ] %>% mutate(...)
    .data
}

I don't actually see any changes to dplyr that would make this much easier.我实际上没有看到dplyr任何更改会使这变得更容易。 case_when is great for when there are multiple different conditions and outcomes for one column but it doesn't help for this case where you want to change multiple columns based on one condition. case_when非常适合当一列有多个不同的条件和结果时,但对于您想根据一个条件更改多个列的情况没有帮助。 Similarly, recode saves typing if you are replacing multiple different values in one column but doesn't help with doing so in multiple columns at once.同样,如果您要替换一列中的多个不同值, recode可以节省输入,但一次在多列中这样做无济于事。 Finally, mutate_at etc. only apply conditions to the column names not the rows in the dataframe.最后, mutate_at等只将条件应用于列名而不是数据mutate_at的行。 You could potentially write a function for mutate_at that would do it but I can't figure out how you would make it behave differently for different columns.您可能会为 mutate_at 编写一个函数来执行此操作,但我无法弄清楚您将如何使其对不同列的行为有所不同。

That said here is how I would approach it using nest form tidyr and map from purrr .这就是我将如何使用nest形式tidyr和来自purrr map来处理它的方法。

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()

With the creation of rlang , a slightly modified version of Grothendieck's 1a example is possible, eliminating the need for the envir argument, as enquo() captures the environment that .p is created in automatically.与创建rlang ,格罗滕迪克的1A示例的稍加修改的版本是可能的,消除了对需要envir参数,如enquo()捕获环境.p是自动创建的。

mutate_rows <- function(.data, .p, ...) {
  .p <- rlang::enquo(.p)
  .p_lgl <- rlang::eval_tidy(.p, .data)
  .data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
  .data
}

dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)

You could split the dataset and do a regular mutate call on the TRUE part.您可以拆分数据集并对TRUE部分进行常规 mutate 调用。

dplyr 0.8 features the function group_split which splits by groups (and groups can be defined directly in the call) so we'll use it here, but base::split works as well. dplyr 0.8具有group_split函数,它按组拆分(并且组可以直接在调用中定义),因此我们将在此处使用它,但base::split可以工作。

library(tidyverse)
df1 %>%
  group_split(measure == "exit", keep=FALSE) %>% # or `split(.$measure == "exit")`
  modify_at(2,~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
  bind_rows()

#    site space measure qty qty.exit delta.watts          cf
# 1     1     4     led   1        0        73.5 0.246240409
# 2     2     3     cfl  25        0        56.5 0.360315879
# 3     5     4     cfl   3        0        38.5 0.279966850
# 4     5     3  linear  19        0        40.5 0.281439486
# 5     2     3  linear  18        0        82.5 0.007898384
# 6     5     1  linear  29        0        33.5 0.392412729
# 7     5     3  linear   6        0        46.5 0.970848817
# 8     4     1     led  10        0        89.5 0.404447182
# 9     4     1     led  18        0        96.5 0.115594622
# 10    6     3  linear  18        0        15.5 0.017919745
# 11    4     3     led  22        0        54.5 0.901829577
# 12    3     3     led  17        0        79.5 0.063949974
# 13    1     3     led  16        0        86.5 0.551321441
# 14    6     4     cfl   5        0        65.5 0.256845013
# 15    4     2     led  12        0        29.5 0.340603733
# 16    5     3  linear  27        0        63.5 0.895166931
# 17    1     4     led   0        0        47.5 0.173088800
# 18    5     3  linear  20        0        89.5 0.438504370
# 19    2     4     cfl  18        0        45.5 0.031725246
# 20    2     3     led  24        0        94.5 0.456653397
# 21    3     3     cfl  24        0        73.5 0.161274319
# 22    5     3     led   9        0        62.5 0.252212124
# 23    5     1     led  15        0        40.5 0.115608182
# 24    3     3     cfl   3        0        89.5 0.066147321
# 25    6     4     cfl   2        0        35.5 0.007888337
# 26    5     1  linear   7        0        51.5 0.835458916
# 27    2     3  linear  28        0        36.5 0.691483644
# 28    5     4     led   6        0        43.5 0.604847889
# 29    6     1  linear  12        0        59.5 0.918838163
# 30    3     3  linear   7        0        73.5 0.471644760
# 31    4     2     led   5        0        34.5 0.972078100
# 32    1     3     cfl  17        0        80.5 0.457241602
# 33    5     4  linear   3        0        16.5 0.492500255
# 34    3     2     cfl  12        0        44.5 0.804236607
# 35    2     2     cfl  21        0        50.5 0.845094268
# 36    3     2  linear  10        0        23.5 0.637194873
# 37    4     3     led   6        0        69.5 0.161431896
# 38    3     2    exit  19       19        13.0 0.000000000
# 39    6     3    exit   7        7        13.0 0.000000000
# 40    6     2    exit  20       20        13.0 0.000000000
# 41    3     2    exit   1        1        13.0 0.000000000
# 42    2     4    exit  19       19        13.0 0.000000000
# 43    3     1    exit  24       24        13.0 0.000000000
# 44    3     3    exit  16       16        13.0 0.000000000
# 45    5     3    exit   9        9        13.0 0.000000000
# 46    2     3    exit   6        6        13.0 0.000000000
# 47    4     1    exit   1        1        13.0 0.000000000
# 48    1     1    exit  14       14        13.0 0.000000000
# 49    6     3    exit   7        7        13.0 0.000000000
# 50    2     4    exit   3        3        13.0 0.000000000

If row order matters, use tibble::rowid_to_column first, then dplyr::arrange on rowid and select it out in the end.如果行顺序很重要, tibble::rowid_to_column使用tibble::rowid_to_column ,然后在rowid上使用dplyr::arrange并最后将其选中。

data数据

df1 <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50),
                 stringsAsFactors = F)

I think this answer has not been mentioned before.我认为这个答案以前没有提到过。 It runs almost as fast as the 'default' data.table -solution..它的运行速度几乎与“默认” data.table -solution 一样快。

Use base::replace()使用base::replace()

df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
                          cf = replace( cf, measure == 'exit', 0 ),
                          delta.watts = replace( delta.watts, measure == 'exit', 13 ) )

replace recycles the replacement value, so when you want the values of columns qty entered into colums qty.exit , you have to subset qty as well... hence the qty[ measure == 'exit'] in the first replacement.. replace 回收替换值,因此当您希望将qty列的值输入到列qty.exit ,您还必须对qty进行子集...因此qty[ measure == 'exit']在第一次替换中..

now, you will probably not want to retype the measure == 'exit' all the time... so you can create an index-vector containing that selection, and use it in the functions above.现在,您可能不想一直重新输入measure == 'exit' ... 所以您可以创建一个包含该选择的索引向量,并在上面的函数中使用它。

#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )

df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
               cf = replace( cf, index.v, 0 ),
               delta.watts = replace( delta.watts, index.v, 13 ) )

benchmarks基准

# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
# data.table   1.005018 1.053370 1.137456 1.112871 1.186228 1.690996   100
# wimpel       1.061052 1.079128 1.218183 1.105037 1.137272 7.390613   100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995   100

At the expense of breaking with the usual dplyr syntax, you can use within from base:在与通常的dplyr语法突破的费用,你可以使用within从基地:

dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
              delta.watts[measure == 'exit'] <- 13)

It seems to integrate well with the pipe, and you can do pretty much anything you want inside it.它似乎与管道整合得很好,你可以在里面做任何你想做的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM