简体   繁体   English

如何根据三列删除重复项,但我使用 R 保留特定列中编号最高的行?

[英]How do I remove duplicates based on three columns, but I keep the row with the highest number in the specific column using R?

I have a dataset that looks like this:我有一个如下所示的数据集:

 Unique Id|Class Id|Version Id
       501        1          1
       602        3          1
       602        3          1
       405        2          1
       305        2          3
       305        2          2
       305        1          1
       305        2          1
       509        1          1
       501        2          1
       501        3          1
       501        3          2
       602        2          1
       602        1          1
       405        1          1

If I were to run the script the remaining entries should be:如果我要运行脚本,剩余的条目应该是:

 Unique Id|Class Id|Version Id
       501        1          1
       602        3          1
       405        2          1
       305        2          3
       305        1          1
       509        1          1
       501        2          1
       501        3          2
       602        2          1
       602        1          1
       405        1          1

Note that Unique id:501 Class id:3 and Version id:2 was selected instead because it has the highest Version id.请注意,唯一 id:501 Class id:3 和版本 id:2 被选中,因为它具有最高的版本 id。 Note Unique id:602 Class id:3 and VersionId:1 is deleted because it is exactly the same from beginning to end.注意 Unique id:602 Class id:3 和 VersionId:1 被删除,因为它从头到尾完全一样。

Basically I want the script to delete all duplicates based on three columns and leave the row with the highest version id.基本上我希望脚本根据三列删除所有重复项,并保留版本 ID 最高的行。

We can use rleid on the UniqueID column and do slice_max after grouping by the rleid on 'Unique Id' and Class Id我们可以在UniqueID列上使用rleid并在按 'Unique Id' 上的rleidClass Id分组后执行slice_max

library(dplyr)
library(data.table)
data %>%      
  group_by(grp = rleid(`Unique Id`), `Class Id`) %>% 
  slice_max(`Version Id`) %>%
  ungroup %>%
  select(-grp) %>%
  distinct

-output -输出

# A tibble: 11 x 3
#   `Unique Id` `Class Id` `Version Id`
#         <int>      <int>        <int>
# 1         501          1            1
# 2         602          3            1
# 3         405          2            1
# 4         305          1            1
# 5         305          2            3
# 6         509          1            1
# 7         501          2            1
# 8         501          3            2
# 9         602          1            1
#10         602          2            1
#11         405          1            1

Or if we don't have to consider the Unique Id with adjacent blocks as one或者,如果我们不必将具有相邻块的Unique Id视为一个

data %>%
    group_by(`Unique Id`, `Class Id`) %>%
    slice_max(`Version Id`) %>% 
    ungroup %>% 
    distinct

Or using base R或使用base R

ind <- with(rle(data$`Unique Id`), rep(seq_along(values), lengths))
data1 <- data[order(ind, -data$`Version Id`),]
data1[!duplicated(cbind(ind, data1$`Class Id`)),]

data数据

data <- structure(list(`Unique Id` = c(501L, 602L, 602L, 405L, 305L, 
305L, 305L, 305L, 509L, 501L, 501L, 501L, 602L, 602L, 405L), 
    `Class Id` = c(1L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 
    3L, 2L, 1L, 1L), `Version Id` = c(1L, 1L, 1L, 1L, 3L, 2L, 
    1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)), class = "data.frame", 
    row.names = c(NA, 
-15L))

If the order doesn't matter then we can reorder the data so that higher version IDs are on top, and then remove duplicated entries.如果顺序无关紧要,那么我们可以重新排序数据,以便更高版本的 ID 位于顶部,然后删除重复的条目。

df <- df[order(df[,1], df[,2], -df[,3]),]
df <- df[!duplicated(df[,-3]),]

df
       Unique Id Class Id Version Id
7        305        1          1
5        305        2          3
15       405        1          1
4        405        2          1
1        501        1          1
10       501        2          1
12       501        3          2
9        509        1          1
14       602        1          1
13       602        2          1
2        602        3          1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何保留重复项,但根据R中的列删除唯一值 - How do I keep duplicates but remove unique values based on column in R 删除重复项但保留基于特定列的行 - remove duplicates but keep the row based on a specific column 如何在 R 的数据框中找出逗号在一行中出现的最大数量? - How do I find out the highest number that commas had appeared in a row in a single column in a data frame in R? 使用 R 如何删除基于多列的重复项,但选择重复项的“最”完成版本 - Using R how do I delete duplicates based on multiple columns but select the "most" completed version of the duplicates 如何在 R 中找到一列的最高编号并打印该行的两列? - How to find the highest number of a column and print two columns of that row in R? 根据R中其他三列的最高值来分配列值 - Assigning column values based on which value is highest in a row of three other columns in R 在 R 中,如何根据一列中的重复值保留行的第一次出现? - In R, how do I keep the first single occurrence of a row based on a repeated value in one column? 如何删除 R 中的重复项? - How do I remove duplicates in R? 如何根据其他列中的值将一列中的特定值向上移动一行? - How do I move specific values in a column up one row based on values in other columns? 如何基于匹配R中的其他列的行值来填充列的值 - How do I fill in values for columns based on matching few other column's row values in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM