简体   繁体   中英

Create a new column with max values using the identifier column within a pipeline

I am trying to clean up some old code and convert over to "tidy". I am trying to create a new column of data within a pipeline that is the maximum age of individual fish. Let's represent the columns of interest as:

fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
                     fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
                     agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))

# which looks like this:
fish_1
   year fishid agei
1  2012      a    1
2  2012      a    2
3  2015      b    1
4  2015      b    2
5  2015      b    3
6  2013      c    1
7  2013      c    2
8  2013      c    3
9  2013      c    4
10 2012      d    1
11 2012      d    2
12 2015      e    1
13 2015      e    2
14 2015      e    3

What I'm trying to do is create a new column agec that is the maximum age for each individual fish repeated however many number of times is required to fill the rows for each fish.

The desired output would be:

fish_2 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
                     fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
                     agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3),
                     agec = c(2,2,3,3,3,4,4,4,4,2,2,3,3,3))
# Which looks like:
fish_2

   year fishid agei agec
1  2012      a    1    2
2  2012      a    2    2
3  2015      b    1    3
4  2015      b    2    3
5  2015      b    3    3
6  2013      c    1    4
7  2013      c    2    4
8  2013      c    3    4
9  2013      c    4    4
10 2012      d    1    2
11 2012      d    2    2
12 2015      e    1    3
13 2015      e    2    3
14 2015      e    3    3

The way I had done this in the past was to use a plyr::ddply() call to create a new dataframe and then merge with fish like this:

caps = plyr::ddply(fish_1, c('fishid'), plyr::summarize, agec=max(agei))
fish = merge(fish_1, caps, by='fishid')
fish

   fishid year agei agec
1       a 2012    1    2
2       a 2012    2    2
3       b 2015    1    3
4       b 2015    2    3
5       b 2015    3    3
6       c 2013    1    4
7       c 2013    2    4
8       c 2013    3    4
9       c 2013    4    4
10      d 2012    1    2
11      d 2012    2    2
12      e 2015    1    3
13      e 2015    2    3
14      e 2015    3    3

I'm hoping someone can help me achieve this data structure concisely within a pipeline. All of the similar questions I have found have been very verbose and not specific to this issue. I am new to using tidyverse but I'm having trouble getting the group_by() function (to replace the ddply() call) within a pipe, and I'm hoping there is a simpler way.

UPDATE

For those interested it appears both answers below are correct. The reason that I struggled was because I was already completing other data manipulations within my pipeline and I tried to complete the formation of the agec column within a previous call to dplyr::mutate() . You can refer to my comment on @Thomas answer to see the error in my ways. Hope this helps.

Try dplyr instead of plyr

library(dplyr)

fish_1 %>% 
  group_by(fishid) %>% 
  mutate(agec = max(agei)) 

You can use group_by from dplyr to group your fish IDs and then simply call mutate ( dplyr as well) with max :

fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
                     fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
                     agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
fish_1 %>% 
  group_by(fishid) %>% 
  mutate(agec = max(agei))
# A tibble: 14 x 4
# Groups:   fishid [5]
    year fishid  agei  agec
   <dbl> <chr>  <dbl> <dbl>
 1  2012 a          1     2
 2  2012 a          2     2
 3  2015 b          1     3
 4  2015 b          2     3
 5  2015 b          3     3
 6  2013 c          1     4
 7  2013 c          2     4
 8  2013 c          3     4
 9  2013 c          4     4
10  2012 d          1     2
11  2012 d          2     2
12  2015 e          1     3
13  2015 e          2     3
14  2015 e          3     3

An option with data.table

library(data.table)   
setDT(fish_1)[, agec := max(agei, na.rm = TRUE), fishid] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM