简体   繁体   English

特定行 dplyr 中列子集的条件行和

[英]Conditional rowwise sum of subset of columns in specific row dplyr

My problem is a bit tricky: I'm working on data edition and I'm close to finding the right solution.我的问题有点棘手:我正在研究数据编辑,并且我即将找到正确的解决方案。 Got a dataframe like this:得到一个 dataframe 像这样:

ID   name   var1  var2  var3 var_total
1     a       1     1    2       4
2     b       2     3    2       7
3     c       1    -1   -1       1

Where var_total is the sum from var1 to var3 of each number that is higher than zero.其中var_total是从 var1 到 var3 的每个大于零的数字的总和。 Say, on ID == 2 I needed to change var2 to -1, doing this:比如说,在 ID == 2 上,我需要将 var2 更改为 -1,这样做:

 df %>% mutate(var2 = if_else(ID == 2, -1, var2))

Which brings this:这带来了:

ID   name   var1  var2  var3 var_total
1     a       1     1    2       4
2     b       2    -1    2       7
3     c       1    -1   -1       1

The problem is, I need to find a way to automatically re-calculate var_total for that row.问题是,我需要找到一种方法来自动重新计算该行的var_total I know how to do it for the whole dataframe, but that's a bit slow:我知道如何为整个 dataframe 做到这一点,但这有点慢:

df %>%
  rowwise() %>%
  mutate(var_total = {
    x <- c_across(starts_with('var'))
    sum(x[x > 0])
    })

Is there any way to perform this operation only on the selected ID ?有没有办法只对选定的ID执行此操作? In this case, my final dataframe would be:在这种情况下,我最终的 dataframe 将是:

ID   name   var1  var2  var3 var_total
1     a       1     1    2       4
2     b       2    -1    2       4
3     c       1    -1   -1       1

Thanks!谢谢!

If you want to efficiently update a single row (or small subset of rows) I would use direct assignment, not dplyr .如果您想有效地更新单行(或行的一小部分),我会使用直接分配,而不是dplyr

var_cols = grep(names(df), pattern = "var[0-9]+", value = T)
recalc_id = 2
df[df$ID %in% recalc_id, "var_total"] = apply(df[df$ID %in% recalc_id, var_cols], 1, \(x) sum(x[x > 0]))

As akrun points out in comments, if it is just a single row, the apply can be skipped:正如 akrun 在评论中指出的那样,如果它只是一行,则可以跳过apply

i = which(df$ID == recalc_id)
row = unlist(df[i, var_cols])
df$var_total[i] = sum(row[row > 0])

Here's the same thing with dplyr::case_when , for a dplyr solution:对于dplyr解决方案,这与dplyr::case_when相同:

df = df %>%
  rowwise() %>%
  mutate(var_total = case_when(
      ID %in% 2 ~{
        x <- c_across(starts_with('var[0-9]+'))
        sum(x[x > 0])
      },
      TRUE ~ var_total
    )
  )

(Note that in both cases we need to change the column name pattern to not include var_total in the sum.) (请注意,在这两种情况下,我们都需要将列名模式更改为在总和中包括var_total 。)

rowwise breaks some vectorization and slows things down, so if you are so concerned about efficiency that recalculating the sum is "too slow", I'd strongly recommend the base solution. rowwise会破坏一些矢量化并减慢速度,因此,如果您非常担心重新计算总和“太慢”的效率,我强烈推荐base解决方案。 You might even find a non-conditional base solution to be plenty fast enough for this row-wise operation.您甚至可能会找到一个非条件基本解决方案,该解决方案对于这种逐行操作来说足够快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM