简体   繁体   中英

How to efficiently calculate the sum and relative count frequency?

I am looking for a one-line dplyr solution for filtering a data.frame on a score and afterwards calculating the relative count frequency.

The classic R solution would be the following:

daf <- data.frame(score=c(1:40),count=sample(1:1000,40,replace=T))
select <- daf$score >= 30
res <- c(sum(daf$count[select]),sum(daf$count[select])/sum(daf$count))

My first dplyr solution duplicates results:

daf %>% 
  mutate(total=sum(count)) %>% 
  filter(score >= 30) %>% 
  summarise(
    sum_count=sum(count),
    sum_rel=sum(count)/total
  )

Any better ideas for the dplyr code?

Assuming the total column will always be the same value, use max(total) instead of total in summarise :

daf %>% 
  mutate(total = sum(count)) %>%
  filter(score >= 30) %>%
  summarise(sum_count = sum(count),
            sum_rel = sum(count) / max(total))

  sum_count   sum_rel
1      5535 0.2621608

Edit: As per @GregorThomas's comment, using first() instead of max() would be more efficient for large data sets.

daf %>% 
  mutate(total = sum(count)) %>%
  filter(score >= 30) %>%
  summarise(sum_count = sum(count),
            sum_rel = sum(count) / first(total))
library(dplyr)
daf %>%
  summarize(
    sum_count = sum(count[score >= 30]), 
    sum_rel = sum_count / sum(count)
  )
#   sum_count  sum_rel
# 1      5661 0.301053

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM