简体   繁体   中英

R dplyr group_by summarise keep last non missing

Consider the following dataset where id uniquely identifies a person, and name varies within id only to the extent of minor spelling issues. I want to aggregate to id level using dplyr :

df= data.frame(id=c(1,1,1,2,2,2),name=c('michael c.','mike', 'michael','','John',NA),var=1:6)

Using group_by(id) yields the correct computation, but I lose the name column:

df %>% group_by(id) %>% summarise(newvar=sum(var)) %>%ungroup()
A tibble: 2 x 2
id newvar
  <dbl>  <int>
1     1      6
2     2     15

Using group_by(id,name) yields both name and id but obviously the "wrong" sums.

I would like to keep the last non-missing observatoin of the name within each group. I basically lack a dplyr version of Statas lastnm() function:

df %>% group_by(id) %>% summarise(sum = sum(var), Name = lastnm(name))
     id    sum  Name
1     1      6  michael
2     2     15  John

Is there a "keep last non missing"-option?

1) Use mutate like this:

df %>% 
  group_by(id) %>% 
  mutate(sum = sum(var)) %>%
  ungroup

giving:

# A tibble: 6 x 4
     id name         var   sum
  <dbl> <fct>      <int> <int>
1     1 michael c.     1     6
2     1 mike           2     6
3     1 michael        3     6
4     2 john           4    15
5     2 john           5    15
6     2 john           6    15

2) Another possibility is:

df %>% 
  group_by(id) %>% 
  summarize(name = name %>% unique %>% toString, sum = sum(var)) %>%
  ungroup

giving:

# A tibble: 2 x 3
     id name                        sum
  <dbl> <chr>                     <int>
1     1 michael c., mike, michael     6
2     2 john                         15

3) Another variation is to only report the first name in each group:

df %>% 
  group_by(id) %>% 
  summarize(name = first(name), sum = sum(var)) %>%
  ungroup

giving:

# A tibble: 2 x 3
     id name         sum
  <dbl> <fct>      <int>
1     1 michael c.     6
2     2 john          15

I posted a feature request on dplyrs github thread, and the reponse there is actually the best answer. For sake of completion I repost it here:

df %>% 
   group_by(id) %>% 
   summarise(sum=sum(var), Name=last(name[!is.na(name)]))

#> # A tibble: 2 x 3
#>      id   sum Name   
#>   <dbl> <int> <chr>  
#> 1     1     6 michael
#> 2     2    15 John

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM