简体   繁体   中英

R function to summarize multiple columns of data with multiple functions, grouped by a column

I have a dataframe that contains the following columns:

  • game_id - chr, 1 ID for each game, multiple rows per game
  • home_lineup - chr
  • away_lineup - chr
  • home_plusminus - int
  • away_plusminus - int
  • home_team - chr
  • away_team - chr

I need to calculate the per game sum of the home_plusminus and away_plusminus for each home_lineup and each away_lineup.

The data looks like this:

game_id home_lineup awaylineup home_Plusminus Away_Plusminus home_team  away_team
12345   L1          L2          -2              2            BOS         ATL
12345   L3          L4           3             -3            BOS         ATL
12345   L3          L4           3             -3            BOS         ATL
45678   L2          L1           3             -3            ATL         BOS
45678   L2          L7           1             -1            ATL         BOS
45678   L8          L1           3             -3            ATL         BOS

The above data shows 2 games played.

I want the final output to look like this:

Team Lineup PlusMinus Pergame
BOS  L1     -8        -4.0
BOS  L3      6         6.0
BOS  L7     -1        -1.0
ATL  L2      6         3.0
ATL  L4     -6        -6.0
ATL  L8      3         3.0

So in the above example, L1 played in two games with a total plusMinus of -8. L3 only played in 1 game.

Here's an approach with tidyr and dplyr .

library(tidyr); library(dplyr)

# Step 1 - make into tidy data frame with one row per observation
home <- df %>% select(game_id, contains("home")) %>% 
  rename("Lineup" = "home_lineup", "Team" = "home_team", "plusminus" = "home_Plusminus")

away <- df %>% select(game_id, contains("away")) %>% 
  rename("Lineup" = "awaylineup", "Team" = "away_team", "plusminus" = "Away_Plusminus")

tidy <- bind_rows(home, away, .id = "location")



# Step 2 - summarize
output <- tidy %>%
  group_by(Team, Lineup) %>%
  summarize(PlusMinus = sum(plusminus),
            PerGame = PlusMinus/n_distinct(game_id)) %>% ungroup()

Output:

> output
# A tibble: 6 x 4
  Team  Lineup PlusMinus PerGame
  <chr> <chr>      <int>   <dbl>
1 ATL   L2             6       3
2 ATL   L4            -6      -6
3 ATL   L8             3       3
4 BOS   L1            -8      -4
5 BOS   L3             6       6
6 BOS   L7            -1      -1

Sample data:

df <- read.table(header = T, stringsAsFactors = F, text = "
                 game_id home_lineup awaylineup  home_Plusminus  Away_Plusminus  home_team   away_team
 12345  L1          L2          -2              2               BOS       ATL
     12345  L3          L4           3             -3               BOS       ATL
     12345  L3          L4           3             -3               BOS       ATL
     45678  L2          L1           3             -3               ATL       BOS
     45678  L2          L7           1             -1               ATL       BOS
     45678  L8          L1           3             -3               ATL       BOS")

Similar solution to Jon's:

library(tidyverse)

dat <- tribble(
  ~game_id, ~home_lineup, ~awaylineup,  ~home_Plusminus,  ~Away_Plusminus,  ~home_team,   ~away_team,
  12345,  "L1",          "L2",          -2,              2,               "BOS",       "ATL",
  12345,  "L3",          "L4",           3,             -3,               "BOS",       "ATL",
  # 12345,  "L3",          "L4",           3,             -3,               "BOS",       "ATL",
  45678,  "L2",          "L1",           3,             -3,               "ATL",       "BOS",
  45678,  "L2",          "L7",           1,             -1,               "ATL",       "BOS",
  45678,  "L8",          "L1",           3,             -3,               "ATL",       "BOS"
)

long <- 
  dat %>% 
  gather(where, team, home_team:away_team) %>% 
  mutate(
    home_lineup = case_when(where == "home_team" ~ home_lineup,
                            TRUE ~ NA_character_),
    away_lineup = case_when(where == "away_team" ~ awaylineup,
                            TRUE ~ NA_character_),
    home_plusminus = case_when(where == "home_team" ~ home_Plusminus,
                            TRUE ~ NA_real_),
    away_plusminus = case_when(where == "away_team" ~ Away_Plusminus,
                            TRUE ~ NA_real_)
  ) %>% 
  select(-home_Plusminus, -Away_Plusminus, -awaylineup) %>% 
  gather(plus_minus_type, plus_minus, home_plusminus:away_plusminus) %>%
  gather(lineup_type, lineup, home_lineup:away_lineup, -where, -team) %>% 
  mutate(
    where = where %>% str_remove("_team"),
    lineup_type = lineup_type %>% str_remove("_") %>% str_remove("lineup"),
    plus_minus_type = lineup_type %>% str_remove("_Plusminus")
  ) %>% 
  drop_na()

long %>% 
  group_by(
    team, lineup
  ) %>% 
  summarise(
    PlusMinus = sum(plus_minus),
    Pergame = sum(plus_minus) / n()
  )
#> # A tibble: 6 x 4
#> # Groups:   team [?]
#>   team  lineup PlusMinus Pergame
#>   <chr> <chr>      <dbl>   <dbl>
#> 1 ATL   L2             6    2   
#> 2 ATL   L4            -3   -3   
#> 3 ATL   L8             3    3   
#> 4 BOS   L1            -8   -2.67
#> 5 BOS   L3             3    3   
#> 6 BOS   L7            -1   -1

Created on 2018-10-26 by the reprex package (v0.2.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM