简体   繁体   中英

keep most recent observations when there are duplicates in R

I have the following data.

  date                         var1    level       score_1     score_2
   2020-02-19 12:10:52.166661    dog      n1           1           3
   2020-02-19 12:17:25.087898    dog      n1           3           6
   2020-02-19 12:34:27.624939    dog      n2           4           3
   2020-02-19 12:35:50.522116    cat      n1           2           0
   2020-02-19 12:38:49.547181    cat      n2           3           4

There should be just one observation for any combination var1 & level. I want to eliminate duplicates and keep only most recent records. in the previous example the first row should be eliminated as dog-n1 from row 2 is more recent. nevertheless, I want to keep row 3 even if var1 is also equal to "dog" because level is different.

so, what I want to obtain:

  date                         var1    level       score_1     score_2
   2020-02-19 12:17:25.087898    dog      n1           3           6
   2020-02-19 12:34:27.624939    dog      n2           4           3
   2020-02-19 12:35:50.522116    cat      n1           2           0
   2020-02-19 12:38:49.547181    cat      n2           3           4

Using tidyverse

df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()

In base R, use duplicated . Looks like your data is already sorted by date, so you can use

df[!duplicated(df[c("var1", "level")], fromLast = TRUE), ]

(by default, duplicated will give FALSE for the first occurrence of anything, and TRUE for every other occurrence. Setting fromLast = TRUE will make reverse the direction, so the last occurrence is kept)

If you're not sure your data is already sorted, sort it first!

df = df[order(df$var1, df$level, dfd$date), ]

You can also use data.table approach as follows:

library(data.table)
setDT(df)[, .SD[which.max(date)], .(var1, level)]

Another tidyverse answer, using dplyr::slice_max() .

To demonstrate with a reproducible example, here is flights data from nycflights13 package:

library(nycflights13) # for the data
library(dplyr, warn.conflicts = FALSE)

my_flights <- # a subset of 3 columns
  flights |> 
  select(carrier, dest, time_hour)

my_flights # preview of the subset data
#> # A tibble: 336,776 × 3
#>    carrier dest  time_hour          
#>    <chr>   <chr> <dttm>             
#>  1 UA      IAH   2013-01-01 05:00:00
#>  2 UA      IAH   2013-01-01 05:00:00
#>  3 AA      MIA   2013-01-01 05:00:00
#>  4 B6      BQN   2013-01-01 05:00:00
#>  5 DL      ATL   2013-01-01 06:00:00
#>  6 UA      ORD   2013-01-01 05:00:00
#>  7 B6      FLL   2013-01-01 06:00:00
#>  8 EV      IAD   2013-01-01 06:00:00
#>  9 B6      MCO   2013-01-01 06:00:00
#> 10 AA      ORD   2013-01-01 06:00:00
#> # … with 336,766 more rows

Grouping by carrier & dest , we can see many rows for each group.

my_flights |> 
  count(carrier, dest)
#> # A tibble: 314 × 3
#>    carrier dest      n
#>    <chr>   <chr> <int>
#>  1 9E      ATL      59
#>  2 9E      AUS       2
#>  3 9E      AVL      10
#>  4 9E      BGR       1
#>  5 9E      BNA     474
#>  6 9E      BOS     914
#>  7 9E      BTV       2
#>  8 9E      BUF     833
#>  9 9E      BWI     856
#> 10 9E      CAE       3
#> # … with 304 more rows

So if we want to deduplicate those in-group rows by taking the most recent time_hour value, we could utilize slice_max()

my_flights |> 
  group_by(carrier, dest) |> 
  slice_max(time_hour)
#> # A tibble: 329 × 3
#> # Groups:   carrier, dest [314]
#>    carrier dest  time_hour          
#>    <chr>   <chr> <dttm>             
#>  1 9E      ATL   2013-05-04 07:00:00
#>  2 9E      AUS   2013-02-03 16:00:00
#>  3 9E      AVL   2013-07-13 11:00:00
#>  4 9E      BGR   2013-10-17 21:00:00
#>  5 9E      BNA   2013-12-31 15:00:00
#>  6 9E      BOS   2013-12-31 14:00:00
#>  7 9E      BTV   2013-09-01 12:00:00
#>  8 9E      BUF   2013-12-31 18:00:00
#>  9 9E      BWI   2013-12-31 19:00:00
#> 10 9E      CAE   2013-12-31 09:00:00
#> # … with 319 more rows

By the same token, we could have used slice_min() to get the rows with the earliest time_hour value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM