[英]How do I remove duplicates based on three columns, but I keep the row with the highest number in the specific column using R?
I have a dataset that looks like this:我有一个如下所示的数据集:
Unique Id|Class Id|Version Id
501 1 1
602 3 1
602 3 1
405 2 1
305 2 3
305 2 2
305 1 1
305 2 1
509 1 1
501 2 1
501 3 1
501 3 2
602 2 1
602 1 1
405 1 1
If I were to run the script the remaining entries should be:如果我要运行脚本,剩余的条目应该是:
Unique Id|Class Id|Version Id
501 1 1
602 3 1
405 2 1
305 2 3
305 1 1
509 1 1
501 2 1
501 3 2
602 2 1
602 1 1
405 1 1
Note that Unique id:501 Class id:3 and Version id:2 was selected instead because it has the highest Version id.请注意,唯一 id:501 Class id:3 和版本 id:2 被选中,因为它具有最高的版本 id。 Note Unique id:602 Class id:3 and VersionId:1 is deleted because it is exactly the same from beginning to end.注意 Unique id:602 Class id:3 和 VersionId:1 被删除,因为它从头到尾完全一样。
Basically I want the script to delete all duplicates based on three columns and leave the row with the highest version id.基本上我希望脚本根据三列删除所有重复项,并保留版本 ID 最高的行。
We can use rleid
on the UniqueID
column and do slice_max
after grouping by the rleid
on 'Unique Id' and Class Id
我们可以在UniqueID
列上使用rleid
并在按 'Unique Id' 上的rleid
和Class Id
分组后执行slice_max
library(dplyr)
library(data.table)
data %>%
group_by(grp = rleid(`Unique Id`), `Class Id`) %>%
slice_max(`Version Id`) %>%
ungroup %>%
select(-grp) %>%
distinct
-output -输出
# A tibble: 11 x 3
# `Unique Id` `Class Id` `Version Id`
# <int> <int> <int>
# 1 501 1 1
# 2 602 3 1
# 3 405 2 1
# 4 305 1 1
# 5 305 2 3
# 6 509 1 1
# 7 501 2 1
# 8 501 3 2
# 9 602 1 1
#10 602 2 1
#11 405 1 1
Or if we don't have to consider the Unique Id
with adjacent blocks as one或者,如果我们不必将具有相邻块的Unique Id
视为一个
data %>%
group_by(`Unique Id`, `Class Id`) %>%
slice_max(`Version Id`) %>%
ungroup %>%
distinct
Or using base R
或使用base R
ind <- with(rle(data$`Unique Id`), rep(seq_along(values), lengths))
data1 <- data[order(ind, -data$`Version Id`),]
data1[!duplicated(cbind(ind, data1$`Class Id`)),]
data <- structure(list(`Unique Id` = c(501L, 602L, 602L, 405L, 305L,
305L, 305L, 305L, 509L, 501L, 501L, 501L, 602L, 602L, 405L),
`Class Id` = c(1L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L,
3L, 2L, 1L, 1L), `Version Id` = c(1L, 1L, 1L, 1L, 3L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-15L))
If the order doesn't matter then we can reorder the data so that higher version IDs are on top, and then remove duplicated entries.如果顺序无关紧要,那么我们可以重新排序数据,以便更高版本的 ID 位于顶部,然后删除重复的条目。
df <- df[order(df[,1], df[,2], -df[,3]),]
df <- df[!duplicated(df[,-3]),]
df
Unique Id Class Id Version Id
7 305 1 1
5 305 2 3
15 405 1 1
4 405 2 1
1 501 1 1
10 501 2 1
12 501 3 2
9 509 1 1
14 602 1 1
13 602 2 1
2 602 3 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.