简体   繁体   中英

Subtracting cell in one row from cell in another row when summarizing grouped data with dplyr?

Background: I have data from a simulation where I have a few variables and thus many resulting combinations of parameters. Due to the internal design of the simulation there can be a little variation among the outcomes of identical sets of parameters, so I run a number of identical runs, then calculate their min, max, and mean score. Then, I want to compare the treatment and no-treatment conditions:

  • calculate the mean of treatment minus no-treatment
  • calculate the difference of the min score of treatment minus max score of no-treatment
  • calculate the difference of the max score of treatment minus min score of no-treatment

This gives me the mean difference but also the bounds of the best- and worst-case comparison.

Example data:

my_data <- tribble(
  ~params, ~treatment, ~mean_score, ~min_score,  ~max_score,
  "combo a", 0, 91,  90, 92,
  "combo a", 1, 92,  92, 92,
  "combo b", 0, 89,  87, 91,
  "combo b", 1, 92,  89, 92,
  "combo c", 0, 90,  90, 90,
  "combo c", 1, 89,  85, 93,
)

Blowing the dust off my R skills, my initial attempt is the following, but I do not know how to tell summarize which row should be subtracted from which within the grouping.

Code attempt I know doesn't work:

my_summ_data <- mydata %>%
  dplyr::group_by(params = as.factor(params)) %>%
  dplyr::summarize(hier_diff=diff(mean_score), 
                   min_max_diff=diff(c(min_score, max_score)),
                   max_min_diff=diff(c(max_score, min_score)) )

I would like to get

params hier_diff min_max_diff max_min_diff
combo a 1 0 2
combo b 3 -2 5
combo c -1 -5 3

but instead I get (btw I don't yet understand why I get these extra rows)

params hier_diff min_max_diff max_min_diff
combo a 1 2 0
combo a 1 0 -2
combo a 1 0 2
combo b 1 2 0
combo b 1 2 -4
combo b 1 0 2
combo c 2 -2 6
combo c 2 2 -6
combo c 2 6 -2

I'm not convinced there is a sensible way to do what I want using summarize. But if there is, I would like to know it, and if not, what is the next best alternative?

Please find below one possible solution.

Reprex

  • Code
library(dplyr)
library(tibble)


my_summ_data <- my_data %>%
  dplyr::group_by(params) %>%
  dplyr::arrange(treatment) %>% 
  dplyr::summarize(hier_diff=diff(mean_score), 
                   min_max_diff=diff(c(max_score[1], min_score[2])),
                   max_min_diff=diff(c(min_score[1], max_score[2])))

  • Output
my_summ_data
#> # A tibble: 3 x 4
#>   params  hier_diff min_max_diff max_min_diff
#>   <chr>       <dbl>        <dbl>        <dbl>
#> 1 combo a         1            0            2
#> 2 combo b         3           -2            5
#> 3 combo c        -1           -5            3

Created on 2022-02-14 by the reprex package (v2.0.1)

my_data %>%
  dplyr::group_by(params = as.factor(params)) %>%
  dplyr::summarize(
    hier_diff= mean_score[treatment==1]       - mean_score[treatment==0],
    min_max_diff=min_score[treatment==1] - max_score[treatment==0],   # EDIT -- removed unneeded min/max
    max_min_diff=max_score[treatment==1] - min_score[treatment==0]    # EDIT -- removed unneeded min/max
  )

Result

# A tibble: 3 x 4
  params  hier_diff min_max_diff max_min_diff
  <fct>       <dbl>        <dbl>        <dbl>
1 combo a         1            0            2
2 combo b         3           -2            5
3 combo c        -1           -5            3

Note that the answer is the same even if the treatment rows appear appear prior to the no-treatment rows, eg:

my_data <- tribble(
  ~params, ~treatment, ~mean_score, ~min_score,  ~max_score,
  "combo a", 1, 92,  92, 92,  # swapped rows 1+2, 3+4, 5+6
  "combo a", 0, 91,  90, 92,
  "combo b", 1, 92,  89, 92,
  "combo b", 0, 89,  87, 91,
  "combo c", 1, 89,  85, 93,
  "combo c", 0, 90,  90, 90,
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM