如何根据三列删除重复项，但我使用 R 保留特定列中编号最高的行？

Question

I have a dataset that looks like this:我有一个如下所示的数据集：

 Unique Id|Class Id|Version Id
       501        1          1
       602        3          1
       602        3          1
       405        2          1
       305        2          3
       305        2          2
       305        1          1
       305        2          1
       509        1          1
       501        2          1
       501        3          1
       501        3          2
       602        2          1
       602        1          1
       405        1          1

If I were to run the script the remaining entries should be:如果我要运行脚本，剩余的条目应该是：

 Unique Id|Class Id|Version Id
       501        1          1
       602        3          1
       405        2          1
       305        2          3
       305        1          1
       509        1          1
       501        2          1
       501        3          2
       602        2          1
       602        1          1
       405        1          1

Note that Unique id:501 Class id:3 and Version id:2 was selected instead because it has the highest Version id.请注意，唯一 id:501 Class id:3 和版本 id:2 被选中，因为它具有最高的版本 id。 Note Unique id:602 Class id:3 and VersionId:1 is deleted because it is exactly the same from beginning to end.注意 Unique id:602 Class id:3 和 VersionId:1 被删除，因为它从头到尾完全一样。

Basically I want the script to delete all duplicates based on three columns and leave the row with the highest version id.基本上我希望脚本根据三列删除所有重复项，并保留版本 ID 最高的行。

Answer 1

We can use rleid on the UniqueID column and do slice_max after grouping by the rleid on 'Unique Id' and Class Id我们可以在UniqueID列上使用rleid并在按 'Unique Id' 上的rleid和Class Id分组后执行slice_max

library(dplyr)
library(data.table)
data %>%      
  group_by(grp = rleid(`Unique Id`), `Class Id`) %>% 
  slice_max(`Version Id`) %>%
  ungroup %>%
  select(-grp) %>%
  distinct

-output -输出

# A tibble: 11 x 3
#   `Unique Id` `Class Id` `Version Id`
#         <int>      <int>        <int>
# 1         501          1            1
# 2         602          3            1
# 3         405          2            1
# 4         305          1            1
# 5         305          2            3
# 6         509          1            1
# 7         501          2            1
# 8         501          3            2
# 9         602          1            1
#10         602          2            1
#11         405          1            1

Or if we don't have to consider the Unique Id with adjacent blocks as one或者，如果我们不必将具有相邻块的Unique Id视为一个

data %>%
    group_by(`Unique Id`, `Class Id`) %>%
    slice_max(`Version Id`) %>% 
    ungroup %>% 
    distinct

Or using base R或使用base R

ind <- with(rle(data$`Unique Id`), rep(seq_along(values), lengths))
data1 <- data[order(ind, -data$`Version Id`),]
data1[!duplicated(cbind(ind, data1$`Class Id`)),]

data数据

data <- structure(list(`Unique Id` = c(501L, 602L, 602L, 405L, 305L, 
305L, 305L, 305L, 509L, 501L, 501L, 501L, 602L, 602L, 405L), 
    `Class Id` = c(1L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 
    3L, 2L, 1L, 1L), `Version Id` = c(1L, 1L, 1L, 1L, 3L, 2L, 
    1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)), class = "data.frame", 
    row.names = c(NA, 
-15L))

Answer 2

If the order doesn't matter then we can reorder the data so that higher version IDs are on top, and then remove duplicated entries.如果顺序无关紧要，那么我们可以重新排序数据，以便更高版本的 ID 位于顶部，然后删除重复的条目。

df <- df[order(df[,1], df[,2], -df[,3]),]
df <- df[!duplicated(df[,-3]),]

df
       Unique Id Class Id Version Id
7        305        1          1
5        305        2          3
15       405        1          1
4        405        2          1
1        501        1          1
10       501        2          1
12       501        3          2
9        509        1          1
14       602        1          1
13       602        2          1
2        602        3          1

如何根据三列删除重复项，但我使用 R 保留特定列中编号最高的行？

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-12-23 22:25:55

data数据

解决方案2
3 2020-12-23 22:45:53

如何根据三列删除重复项，但我使用 R 保留特定列中编号最高的行？

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-12-23 22:25:55

data数据

解决方案2 3 2020-12-23 22:45:53

解决方案1
3 已采纳 2020-12-23 22:25:55

解决方案2
3 2020-12-23 22:45:53