I have the following data.
date var1 level score_1 score_2
2020-02-19 12:10:52.166661 dog n1 1 3
2020-02-19 12:17:25.087898 dog n1 3 6
2020-02-19 12:34:27.624939 dog n2 4 3
2020-02-19 12:35:50.522116 cat n1 2 0
2020-02-19 12:38:49.547181 cat n2 3 4
There should be just one observation for any combination var1 & level. I want to eliminate duplicates and keep only most recent records. in the previous example the first row should be eliminated as dog-n1 from row 2 is more recent. nevertheless, I want to keep row 3 even if var1 is also equal to "dog" because level is different.
so, what I want to obtain:
date var1 level score_1 score_2
2020-02-19 12:17:25.087898 dog n1 3 6
2020-02-19 12:34:27.624939 dog n2 4 3
2020-02-19 12:35:50.522116 cat n1 2 0
2020-02-19 12:38:49.547181 cat n2 3 4
Using tidyverse
df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()
In base R, use duplicated
. Looks like your data is already sorted by date, so you can use
df[!duplicated(df[c("var1", "level")], fromLast = TRUE), ]
(by default, duplicated
will give FALSE
for the first occurrence of anything, and TRUE
for every other occurrence. Setting fromLast = TRUE
will make reverse the direction, so the last occurrence is kept)
If you're not sure your data is already sorted, sort it first!
df = df[order(df$var1, df$level, dfd$date), ]
You can also use data.table
approach as follows:
library(data.table)
setDT(df)[, .SD[which.max(date)], .(var1, level)]
Another tidyverse answer, using dplyr::slice_max()
.
To demonstrate with a reproducible example, here is flights
data from nycflights13
package:
library(nycflights13) # for the data
library(dplyr, warn.conflicts = FALSE)
my_flights <- # a subset of 3 columns
flights |>
select(carrier, dest, time_hour)
my_flights # preview of the subset data
#> # A tibble: 336,776 × 3
#> carrier dest time_hour
#> <chr> <chr> <dttm>
#> 1 UA IAH 2013-01-01 05:00:00
#> 2 UA IAH 2013-01-01 05:00:00
#> 3 AA MIA 2013-01-01 05:00:00
#> 4 B6 BQN 2013-01-01 05:00:00
#> 5 DL ATL 2013-01-01 06:00:00
#> 6 UA ORD 2013-01-01 05:00:00
#> 7 B6 FLL 2013-01-01 06:00:00
#> 8 EV IAD 2013-01-01 06:00:00
#> 9 B6 MCO 2013-01-01 06:00:00
#> 10 AA ORD 2013-01-01 06:00:00
#> # … with 336,766 more rows
Grouping by carrier
& dest
, we can see many rows for each group.
my_flights |>
count(carrier, dest)
#> # A tibble: 314 × 3
#> carrier dest n
#> <chr> <chr> <int>
#> 1 9E ATL 59
#> 2 9E AUS 2
#> 3 9E AVL 10
#> 4 9E BGR 1
#> 5 9E BNA 474
#> 6 9E BOS 914
#> 7 9E BTV 2
#> 8 9E BUF 833
#> 9 9E BWI 856
#> 10 9E CAE 3
#> # … with 304 more rows
So if we want to deduplicate those in-group rows by taking the most recent time_hour
value, we could utilize slice_max()
my_flights |>
group_by(carrier, dest) |>
slice_max(time_hour)
#> # A tibble: 329 × 3
#> # Groups: carrier, dest [314]
#> carrier dest time_hour
#> <chr> <chr> <dttm>
#> 1 9E ATL 2013-05-04 07:00:00
#> 2 9E AUS 2013-02-03 16:00:00
#> 3 9E AVL 2013-07-13 11:00:00
#> 4 9E BGR 2013-10-17 21:00:00
#> 5 9E BNA 2013-12-31 15:00:00
#> 6 9E BOS 2013-12-31 14:00:00
#> 7 9E BTV 2013-09-01 12:00:00
#> 8 9E BUF 2013-12-31 18:00:00
#> 9 9E BWI 2013-12-31 19:00:00
#> 10 9E CAE 2013-12-31 09:00:00
#> # … with 319 more rows
By the same token, we could have used slice_min()
to get the rows with the earliest time_hour
value.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.