简体   繁体   中英

Conditional rowwise sum of subset of columns in specific row dplyr

My problem is a bit tricky: I'm working on data edition and I'm close to finding the right solution. Got a dataframe like this:

ID   name   var1  var2  var3 var_total
1     a       1     1    2       4
2     b       2     3    2       7
3     c       1    -1   -1       1

Where var_total is the sum from var1 to var3 of each number that is higher than zero. Say, on ID == 2 I needed to change var2 to -1, doing this:

 df %>% mutate(var2 = if_else(ID == 2, -1, var2))

Which brings this:

ID   name   var1  var2  var3 var_total
1     a       1     1    2       4
2     b       2    -1    2       7
3     c       1    -1   -1       1

The problem is, I need to find a way to automatically re-calculate var_total for that row. I know how to do it for the whole dataframe, but that's a bit slow:

df %>%
  rowwise() %>%
  mutate(var_total = {
    x <- c_across(starts_with('var'))
    sum(x[x > 0])
    })

Is there any way to perform this operation only on the selected ID ? In this case, my final dataframe would be:

ID   name   var1  var2  var3 var_total
1     a       1     1    2       4
2     b       2    -1    2       4
3     c       1    -1   -1       1

Thanks!

If you want to efficiently update a single row (or small subset of rows) I would use direct assignment, not dplyr .

var_cols = grep(names(df), pattern = "var[0-9]+", value = T)
recalc_id = 2
df[df$ID %in% recalc_id, "var_total"] = apply(df[df$ID %in% recalc_id, var_cols], 1, \(x) sum(x[x > 0]))

As akrun points out in comments, if it is just a single row, the apply can be skipped:

i = which(df$ID == recalc_id)
row = unlist(df[i, var_cols])
df$var_total[i] = sum(row[row > 0])

Here's the same thing with dplyr::case_when , for a dplyr solution:

df = df %>%
  rowwise() %>%
  mutate(var_total = case_when(
      ID %in% 2 ~{
        x <- c_across(starts_with('var[0-9]+'))
        sum(x[x > 0])
      },
      TRUE ~ var_total
    )
  )

(Note that in both cases we need to change the column name pattern to not include var_total in the sum.)

rowwise breaks some vectorization and slows things down, so if you are so concerned about efficiency that recalculating the sum is "too slow", I'd strongly recommend the base solution. You might even find a non-conditional base solution to be plenty fast enough for this row-wise operation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM