简体   繁体   中英

More efficient way to compute mean for subset

In this dataframe:

df <- data.frame(
  comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
  hbr = c(101:110)
)

let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story" , how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grp I want to compute the mean for manually :

library(dplyr)
library(data.table)
df %>%
  mutate(grp = rleid(comp)) %>%
  summarise(M = mean(hbr[grp==2]))
      M
1 103.5

I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':

df %>%
  mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
  filter(grp == min(grp, na.rm = TRUE)) %>%
  summarise(M = mean(hbr))
#>       M
#> 1 103.5

In base R, you can select the desired rows using cumsum and diff , and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.

idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5

mean(df$hbr[first])
#[1] 103.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM