简体   繁体   中英

Accumulate number of distinct values in a column grouped by another variable in R

I am struggling with a data request from a data.frame in R, which contains a table with clubs, their IDs, the year they have been founded, the ID of the district they belong to, and the year they have been closed. The data.frame dat looks like

club_id   founded    district_id    closed
   1       2012         1             NA
   2       2012         2            2014
   3       2013         2             NA
   4       2013         3            2014
  ...

NA in the closed column means that the club still exists in the current year (2016).

What I want is the number of still existing clubs for each year, eg, this would give me for the data snippet from above:

2012   2013   2014 ...
  2      4     2   ...

I tried something using dplyr

dat %>%
 group_by(founded) %>%
 summarise(clubs_per_year = n_distinct(club_id))

However, this gives me only the number of new clubs founded in each year and not the total number of existing clubs per year.

I'm not sure how to achieve this using dplyr , but here's a possible data.table solution. This is basically creates a sequence per club while replacing NA with the current year and then counts the incidents per year

library(data.table)
setDT(df)[, .(Year = founded:(replace(closed, is.na(closed), year(Sys.Date())) - 1L)), 
            by = club_id
          ][, .(Uniques = uniqueN(club_id)), by = Year]

#    Year Uniques
# 1: 2012       2
# 2: 2013       4
# 3: 2014       2
# 4: 2015       2

I tried a full-dplyr solution. The plan of attack is to generate a sequence of active years for each club, then count the club-ids in each active year.

First, we figure out the last active year for each club.

max_year <- 2015
years <- data_frame(
  club_id = 1:4, 
  founded = c(2012, 2012, 2013, 2013),
  closed = c(NA, 2014, NA, 2014))

years <- years %>% 
  mutate(last_active = ifelse(is.na(closed), max_year, closed - 1))
years
#> Source: local data frame [4 x 4]
#> 
#>   club_id founded closed last_active
#>     (int)   (dbl)  (dbl)       (dbl)
#> 1       1    2012     NA        2015
#> 2       2    2012   2014        2013
#> 3       3    2013     NA        2015
#> 4       4    2013   2014        2013

Next, we create another data-frame that contains one row for each active year in each founded - last_active range in the data. We do this by using the do function. do let's us compute arbitrary functions on data-frames; the only rule is that the function needs to return a data-frame. do respects grouping variables, so those grouping columns are returned as well.

# Create a single-column data-frame with a sequence of values
seq_df <- function(col_name, min, max) {
  data.frame(seq(min, max)) %>% setNames(col_name)
}

year_scheme <- years %>% 
  # Find each found-last_active pairings
  select(founded, last_active) %>% 
  distinct %>% 
  # Create a sequence of rows for each of those pairings
  group_by(founded, last_active) %>% 
  do(seq_df("active_year", .$founded, .$last_active)) %>% 
  ungroup
year_scheme
#> Source: local data frame [10 x 3]
#> 
#>    founded last_active active_year
#>      (dbl)       (dbl)       (int)
#> 1     2012        2013        2012
#> 2     2012        2013        2013
#> 3     2012        2015        2012
#> 4     2012        2015        2013
#> 5     2012        2015        2014
#> 6     2012        2015        2015
#> 7     2013        2013        2013
#> 8     2013        2015        2013
#> 9     2013        2015        2014
#> 10    2013        2015        2015

Finally, we can join the tables and count the group ids.

full_years <- left_join(years, year_scheme)
#> Joining by: c("founded", "last_active")
full_years
#> Source: local data frame [10 x 5]
#> 
#>    club_id founded closed last_active active_year
#>      (int)   (dbl)  (dbl)       (dbl)       (int)
#> 1        1    2012     NA        2015        2012
#> 2        1    2012     NA        2015        2013
#> 3        1    2012     NA        2015        2014
#> 4        1    2012     NA        2015        2015
#> 5        2    2012   2014        2014        2012
#> 6        2    2012   2014        2014        2013
#> 7        3    2013     NA        2015        2013
#> 8        3    2013     NA        2015        2014
#> 9        3    2013     NA        2015        2015
#> 10       4    2013   2014        2014        2013

# years per club
full_years %>% count(club_id)
#> Source: local data frame [4 x 2]
#> 
#>   club_id     n
#>     (int) (int)
#> 1       1     4
#> 2       2     2
#> 3       3     3
#> 4       4     1

# clubs per year
full_years %>% count(active_year)
#> Source: local data frame [4 x 2]
#> 
#>   active_year     n
#>         (int) (int)
#> 1        2012     2
#> 2        2013     4
#> 3        2014     2
#> 4        2015     2

This solution uses dcast from the data.table package:

library(data.table)

##Example data
DT <- data.table(club_id=1:4, founded=rep(2012:2013, each=2),
                 district_id=c(1, 2, 2, 3), closed=rep(c(NA, 2014), 2))

## Fill in NAs with current year, create row for each year the club
##  exists, cast to columns for each year, and get the count of clubs
##  per year using length function
dcast(DT[, .(year=founded:ifelse(is.na(closed), year(Sys.Date()), closed)),
      by=club_id],  . ~ year, length, fill=0)

##    . 2012 2013 2014 2015 2016
## 1: .    2    4    4    2    2

The version below is similar, but does not count a club in a year unless it was open for the entire year. Beware of the case where a club was opened and closed in the same year. I've added a club that was opened and closed in 2015.

DT2 <- data.table(club_id=1:5, founded=c(rep(2012:2013, each=2), 2015),
                  district_id=c(1, 2, 2, 3, 3),
                  closed=c(rep(c(NA, 2014), 2), 2015))

## Fill in missing values with the current year
DT2[, closed2:=ifelse(is.na(closed), year(Sys.Date()), closed)]

## Cast to columns as before, ignore cases where the club's open and
##  closed years match, and then subtract one from the closed year
dcast(DT2[founded!=closed2, .(year=founded:(closed2-1)), by=club_id],
      . ~ year, length, fill=0)

##    . 2012 2013 2014 2015
## 1: .    2    4    2    2

Here is a dplyr solution with a different approach

Note: I came up with this a number of days ago, but I found a bug with n_distinct and reported it; it is now fixed in the latest development version. In an older version of dplyr I had to use dplyr::n_distinct but the current version only requires n_distinct .

This approach creates a data.frame with year column and modifies it based on the dat data.frame (the data from OP)

library(dplyr)

yrdf <- data.frame(year = 2012:2015) # "dat" could be used to create this as well.
## For each year calculate the count based on the data in 'dat'
yrdf %>%
    group_by(year) %>%
    mutate(count = n_distinct(
                       dat$club_id[ (is.na(dat$closed) | (dat$closed > year)) & dat$founded <= year]
                       )
    ) %>%
    ungroup

##    year count
##   (int) (int)
## 1  2012     2
## 2  2013     4
## 3  2014     2
## 4  2015     2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM