如何使用 R dplyr's summarize 来计算符合条件的行数？

Question

I have a dataset that I want to summarize.我有一个要总结的数据集。 First, I want the sum of the home and away games, which I can do.首先，我想要主场和客场比赛的总和，我可以做到。 However, I also want to know how many outliers (defined as more than 300 points) are within each subcategory (home, away).但是，我还想知道每个子类别（主场、客场）中有多少异常值（定义为超过 300 分）。

If I wasn't using summarize, I know dplyr has the count() function, but I'd like this solution to appear in my summarize() call.如果我没有使用 summarize，我知道dplyr有count() function，但我希望这个解决方案出现在我的summarize()调用中。 Here's what I have and what I've tried, which fails to perform:这是我所拥有的和我尝试过的，但未能执行：

#Test data
library(dplyr)

test <- tibble(score = c(100, 150, 200, 301, 150, 345, 102, 131),
                  location = c("home", "away", "home", "away", "home", "away", "home", "away"),
                  more_than_300 = c(FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE))


#attempt 1, count rows that match a criteria
test %>%
  group_by(location) %>%
  summarize(total_score = sum(score),
            n_outliers = nrow(.[more_than_300 == FALSE]))

Answer 1

You can use sum on logical vectors - it will automatically convert them into numeric values ( TRUE being equal to 1 and FALSE being equal to 0), so you need only do:您可以对逻辑向量使用sum - 它会自动将它们转换为数值（ TRUE等于 1， FALSE等于 0），因此您只需执行以下操作：

test %>%
  group_by(location) %>%
  summarize(total_score = sum(score),
            n_outliers  = sum(more_than_300))
#> # A tibble: 2 x 3
#>   location total_score n_outliers
#>   <chr>          <dbl>      <int>
#> 1 away             927          2
#> 2 home             552          0

Or, if these are your only 3 columns, an equivalent would be:或者，如果这些是您仅有的 3 列，则等效项是：

test %>%
  group_by(location) %>%
  summarize(across(everything(), sum))

In fact, you don't need to make the more_than_300 column - it would suffice to do:事实上，您不需要制作more_than_300列 - 这样做就足够了：

test %>%
  group_by(location) %>%
  summarize(total_score = sum(score),
            n_outliers  = sum(score > 300))

Answer 2

In base R, we can try aggregate like this在 base R 中，我们可以像这样尝试aggregate

> aggregate(.~location,test,sum)
  location score more_than_300
1     away   927             2
2     home   552             0

Answer 3

In base xtabs could be used to sum up per group.在基础xtabs中可以用来总结每组。

xtabs(cbind(score, more_than_300) ~ ., test)
#location score more_than_300
#    away   927             2
#    home   552             0

Or by calculating the outliers on the fly and giving desired column names.或者通过动态计算异常值并给出所需的列名。

xtabs(cbind(total_score = score, n_outliers = score > 300) ~ location, test)
#location total_score n_outliers
#    away         927          2
#    home         552          0

Another option, also in base, will be rowsum .另一个选项，也是在 base 中，将是rowsum 。

with(test, rowsum(cbind(total_score = score, n_outliers = score > 300), location))
#     total_score n_outliers
#away         927          2
#home         552          0

xtabs and rowsum are specialized in calculating sums per group and might be performant in this task. xtabs和rowsum专门用于计算每组的总和，并且可能在此任务中表现出色。

如何使用 R dplyr's summarize 来计算符合条件的行数？

问题描述

3 个解决方案

解决方案1
7 已采纳 2022-04-19 12:24:46

解决方案2
4 2022-04-19 12:31:44

解决方案3
3 2022-04-19 12:56:12

如何使用 R dplyr's summarize 来计算符合条件的行数？

问题描述

3 个解决方案

解决方案1 7 已采纳 2022-04-19 12:24:46

解决方案2 4 2022-04-19 12:31:44

解决方案3 3 2022-04-19 12:56:12

解决方案1
7 已采纳 2022-04-19 12:24:46

解决方案2
4 2022-04-19 12:31:44

解决方案3
3 2022-04-19 12:56:12