I have a data-set with multiple categorical variables
data <- data_frame(
HomeTeam = c("Team1", "Team2", "Team3", "Team4", "Team2", "Team2", "Team4",
"Team3", "Team2", "Team1", "Team3", "Team2"),
AwayTeam = c("Team2", "Team1", "Team4", "Team3", "Team1", "Team4", "Team1",
"Team2", "Team3", "Team3", "Team4", "Team1"),
HomeScore = c(10, 5, 12, 18, 17, 19, 23, 17, 34, 19, 8, 3),
AwayScore = c(4, 16, 9, 19, 16, 4, 8, 21, 6, 5, 9, 17),
Venue = c("Ground1", "Ground2", "Ground3", "Ground3", "Ground1", "Ground2",
"Ground1", "Ground3", "Ground2", "Ground3", "Ground4", "Ground2"))
I basically want to summarise "HomeTeam" and "AwayTeam" by count into a new table, as per below
HomeTeam NumberOfGamesHome NumberOfGamesaWAY
<chr> <int> <int>
1 Team1 2 4
2 Team2 5 2
3 Team3 3 3
4 Team4 2 3
My current approach requires two group-by lines of code, then joining the tables
HomeTeamCount <- data %>%
group_by(HomeTeam) %>%
summarise(NumberOfGamesHome = n())
AwayTeamCount <- data %>%
group_by(AwayTeam) %>%
summarise(NumberOfGamesAway = n())
Desired <- left_join(HomeTeamCount, AwayTeamCount,
by = c("HomeTeam" = "AwayTeam"))
In my actual data-set, I have a large number of categorical variables, and following the above approach seems laborious and inefficient
Is there a way with dplyr to group_by multiple categorical variables, to produce the desired output? Or potentially data.table?
I have consulted several other questions such as here and here , but have not been able to figure out the answer.
Here is one possibility using gather
to spread data from wide to long, grouping by teams and summarising the number of home and away games.
library(tidyverse)
data %>%
gather(key, Team) %>%
group_by(Team) %>%
summarise(
NumberOfGamesHome = sum(key == "HomeTeam"),
NumberOfGamesaWAY = sum(key == "AwayTeam"))
## A tibble: 4 x 3
# Team NumberOfGamesHome NumberOfGamesaWAY
# <chr> <int> <int>
#1 Team1 2 4
#2 Team2 5 2
#3 Team3 3 3
#4 Team4 2 3
To reflect your updated sample data with additional columns you can do
data %>%
gather(key, Team, HomeTeam, AwayTeam) %>%
group_by(Team) %>%
summarise(
NumberOfGamesHome = sum(key == "HomeTeam"),
NumberOfGamesaWAY = sum(key == "AwayTeam"))
## A tibble: 4 x 3
# Team NumberOfGamesHome NumberOfGamesaWAY
# <chr> <int> <int>
#1 Team1 2 4
#2 Team2 5 2
#3 Team3 3 3
#4 Team4 2 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.