简体   繁体   中英

dplyr grouping across multiple columns in r?

have some nba data that looks like this -

>head(rebs)

game_id          a1               a2          a3             a4          a5           h1             h2         h3           h4
1 21800001 Dario Saric Robert Covington Joel Embiid Markelle Fultz Ben Simmons Jayson Tatum Gordon Hayward Al Horford Jaylen Brown
2 21800001 Dario Saric Robert Covington Joel Embiid Markelle Fultz Ben Simmons Jayson Tatum Gordon Hayward Al Horford Jaylen Brown
3 21800001 Dario Saric Robert Covington Joel Embiid Markelle Fultz Ben Simmons Jayson Tatum Gordon Hayward Al Horford Jaylen Brown
4 21800001 Dario Saric Robert Covington Joel Embiid Markelle Fultz Ben Simmons Jayson Tatum Gordon Hayward Al Horford Jaylen Brown
5 21800001 Dario Saric Robert Covington Joel Embiid Markelle Fultz Ben Simmons Jayson Tatum Gordon Hayward Al Horford Jaylen Brown
6 21800001 Dario Saric Robert Covington Joel Embiid Markelle Fultz Ben Simmons Jayson Tatum Gordon Hayward Al Horford Jaylen Brown
      h5           player       team    event_type              type     reb
1 Kyrie Irving                       start of period   start of period   0
2 Kyrie Irving       Al Horford  PHI       jump ball         jump ball   0
3 Kyrie Irving Robert Covington  PHI            miss         Jump Shot   0
4 Kyrie Irving                               rebound      team rebound   0
5 Kyrie Irving     Jayson Tatum  BOS            miss         Jump Shot   0
6 Kyrie Irving      Dario Saric  PHI         rebound rebound defensive   1

game_id is the id of the game being played. there's a full season of data so there's many different games in this set.

this is NBA data on the play by play level. a1:a5 is away team players currently on the floor, h1:h5 home team players currently on floor.

player is the name of the player who made the relevant play being described in that row

team is the team of the player who made the relevant play being described in that row

reb is a binary, with 1 meaning that a rebound was made, and 0 being everything else. So, the 6th play tracked in this data was a rebound made by Dario Saric (Philadelphia).

I want to find the number of rebounds each player's team made while that player was on the floor, grouped at the game level. One thing that makes this difficult is that throughout the dataset, players will move all throughout a1:h5, ie in this first game Dario Saric is later listed under a4 and a5. So, it's basically random where a player will be listed in the a1-h5 lineup while they're on the floor (except that away team is all a1:5, home team is h1:5).

here's what i used to find rebounds by a player, grouped by each game:

library(dplyr)
rebs %>%
group_by(game_id, player) %>%
summarize(rebs = sum(reb))

I'm unsure of how to find the number of rebounds a team had while each player was on the floor though. Eg. in the 6th play example, I would want that to count towards all 5 of the philadelphia players currently on the floor, not just Dario Saric.

Trying to use dplyr to do this, but not sure if it's possible. I'm trying to use group_by(game_id, team) and then doing an %in% across a1:h5, but nothing is clicking. Any help greatly appreciated!

Using tidyverse you could try the following. This may not be the most efficient method.

First would filter for reb == 1 if only interested in looking at the rebound data, and ignore the rest of the plays available.

Would then assign a number for each of the rebound plays.

You can pivot_longer to put your player names on the floor for each play into long format. This will also separate your "home" vs. "away" players, so you can give credit for the same team's players. Perhaps you could use team though this was missing for other plays.

If you group_by game_id , whether home vs. away, and the play number, you can count up teammate rebounds, checking if the player making the rebound is %in% other players (sharing home vs. away values).

Then you can group_by each team player and sum these rebounds.

library(tidyverse)

rebs %>%
  filter(reb == 1) %>%
  mutate(play_number = row_number()) %>%
  pivot_longer(a1:h5, names_to = c("home_away", "num"), values_to = "team_player", names_pattern = "(a|h)(\\d)") %>%
  group_by(game_id, home_away, play_number) %>%
  mutate(teammate_reb = ifelse(player %in% team_player, 1, 0)) %>%
  group_by(game_id, team_player) %>%
  summarise(reb_total = sum(teammate_reb))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM