简体   繁体   中英

Indexing a group in R with dplyr

I have a dataset as below:

structure(AI_decs)

                 Horse             Time RaceID dyLTO Value.LTO Draw.IV
1         Warne's Army 06/04/2021 13:00      1    56      3429    0.88
2        G For Gabrial 06/04/2021 13:00      1    57      3299    1.15
3         First Charge 06/04/2021 13:00      1    66      3429    1.06
4        Dream With Me 06/04/2021 13:00      1    62      2862    0.97
5             Qawamees 06/04/2021 13:00      1    61      4690    0.97
6          Glan Y Gors 06/04/2021 13:00      1    59      3429    1.50
7     The Dancing Poet 06/04/2021 13:00      1    42      4690    1.41
8               Finoah 06/04/2021 13:00      1    59     10260    0.97
9            Ravenscar 06/04/2021 13:30      2    58      5208    0.65
10          Arabescato 06/04/2021 13:30      2    57      2862    1.09
11        Thai Terrier 06/04/2021 13:30      2    58      7439    1.30
12   The Rutland Rebel 06/04/2021 13:30      2    55      3429    2.17
13         Red Tornado 06/04/2021 13:30      2    49      3340    0.43
14             Alfredo 06/04/2021 13:30      2    54      5208    1.30
15     Tynecastle Park 06/04/2021 13:30      2    72      7439    0.87
16           Waldkonig 06/04/2021 14:00      3    55      3493    1.35
17       Kaleidoscopic 06/04/2021 14:00      3    68      7439    1.64
18           Louganini 06/04/2021 14:00      3    75     56025    1.26

I have a list of columns with performance data values for horses in a race. My dataset has many more rows and it contains a number of horse races on a given day. Each horse race has a unique time and a different number of horses in each race.

Basically, I want to assign a raceId (index number) to each individual race.
I am currently having to do this in excel (see column RaceID) by comparing the Time column and adding 1 to the RaceId value every time we encounter a new race. This has to be done manually each day before I import into R.

I hope there is a way to do this in R Dplyr. I thought if I use Group_by 'Time' there might be a function a bit like n() or row_number() that would index the races for me.

Perhaps using Case_when and lag/lead.

Thanks in advance for any help. Graham

Try this:

Note: group_indices() was deprecated in dplyr 1.0.0.

library(dplyr)


df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))

  
    df  %>% 
    group_by(time) %>% 
    mutate(race_id = cur_group_id())
#> # A tibble: 12 x 2
#> # Groups:   time [4]
#>    time             race_id
#>    <chr>              <int>
#>  1 06/04/2021 13:00       1
#>  2 06/04/2021 13:00       1
#>  3 06/04/2021 13:00       1
#>  4 06/04/2021 13:30       2
#>  5 06/04/2021 13:30       2
#>  6 06/04/2021 13:30       2
#>  7 06/04/2021 14:00       3
#>  8 06/04/2021 14:00       3
#>  9 06/04/2021 14:00       3
#> 10 07/04/2021 14:00       4
#> 11 07/04/2021 14:00       4
#> 12 07/04/2021 14:00       4

Created on 2021-04-10 by the reprex package (v2.0.0)

You can group by data.table 's function rleid (ie, run length ID):

library(dplyr)
library(data.table)
df  %>% 
      group_by(race_id = rleid(time))
# A tibble: 12 x 2
# Groups:   race_id [4]
   time             race_id
   <chr>              <int>
 1 06/04/2021 13:00       1
 2 06/04/2021 13:00       1
 3 06/04/2021 13:00       1
 4 06/04/2021 13:30       2
 5 06/04/2021 13:30       2
 6 06/04/2021 13:30       2
 7 06/04/2021 14:00       3
 8 06/04/2021 14:00       3
 9 06/04/2021 14:00       3
10 07/04/2021 14:00       4
11 07/04/2021 14:00       4
12 07/04/2021 14:00       4

Data, from @Peter:

df <- data.frame(time = rep(c("06/04/2021 13:00", "06/04/2021 13:30", "06/04/2021 14:00", "07/04/2021 14:00"), each = 3))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM