简体   繁体   中英

Grouping within group in R, plyr/dplyr

I'm working on the baseball data set:

data(baseball, package="plyr") 
library(dplyr)

baseball[,1:4] %>% head
           id year stint team
4   ansonca01 1871     1  RC1
44  forceda01 1871     1  WS3
68  mathebo01 1871     1  FW1
99  startjo01 1871     1  NY2
102 suttoez01 1871     1  CL1
106 whitede01 1871     1  CL1

First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:

baseball[,1:4] %>% group_by(team) %>% 
    summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))

# A tibble: 132 × 3
    team first_year num_distinct_players
   <chr>      <int>                <int>
1    ALT       1884                    1
2    ANA       1997                   29
3    ARI       1998                   43
4    ATL       1966                  133
5    BAL       1954                  158

Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?

Perhaps this helps

baseball %>% 
   select(1:4) %>% 
   group_by(id, team) %>%
   dplyr::mutate(nyear = n_distinct(year)) %>% 
   group_by(team) %>%
   dplyr::summarise(first_year = min(year), 
                    num_distinct_players = n_distinct(id),
                    maxYear = max(nyear))

I tried doing this with base R and came up with this. It's fairly slow.

df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
                    cbind(  min(x$year),
                            length(unique(x$id)),
                            max(sapply(split(x,x$id), function(y)
                                            nrow(y))),
                            names(which.max(sapply(split(x,x$id), function(y)
                                            nrow(y)))) ))))

colnames(df) = c("Year", "Unique Players", "Longest played duration",
                                            "Longest Playing Player")
  1. First, split by team into different groups
  2. For each group, obtain the minimum year as first year when the team appears
  3. Get length of unique id s which is the number of players in that team
  4. Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
  5. For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM