简体   繁体   English

删除重复项但保留基于特定列的行

[英]remove duplicates but keep the row based on a specific column

I have a large dataset that was built by combining data from multiple sources.我有一个大型数据集,它是通过组合来自多个来源的数据而构建的。 Hence, there are a number of rows that are duplicates.因此,有许多行是重复的。 I know how to remove duplicates using dplyr and distinct but I would like to have it always keep the row based on a specific value in a cell (source file).我知道如何使用 dplyr 和 distinct 删除重复项,但我希望它始终根据单元格(源文件)中的特定值保留行。 Essentially we have a ranking of which sources we prefer.本质上,我们对我们更喜欢哪些来源进行了排名。 Below is a very simplified dataset to use as an example:下面是一个非常简化的数据集,用作示例:

mydata = data.frame (species =c ('myli','myli','myli','myli','myli','stili','stili','stili'),
                     count = c (10,10,15,15,12,10,10,10),
                     year =c(2020,2020,2021,2021,2019,2017,2017,2018),
                     source =c('zd','steam','ted','steam','zd','steam','ted','steam'))
    
    
    mydata

  species count year source
1    myli    10 2020     zd
2    myli    10 2020  steam
3    myli    15 2021    ted
4    myli    15 2021  steam
5    myli    12 2019     zd
6   stili    10 2017  steam
7   stili    10 2017    ted
8   stili    10 2018  steam

I do the following to remove the duplicates:我执行以下操作来删除重复项:

library(dplyr)
 
# Remove duplicate rows of the dataframe using 'species', 'count', and 'year' variables
distinct(mydata, species, count, year, .keep_all= TRUE)

  species count year source
1    myli    10 2020     zd
2    myli    15 2021    ted
3    myli    12 2019     zd
4   stili    10 2017  steam
5   stili    10 2018  steam

However, I want to ensure that the rows that are kept when there are duplicates prioritize the 'source' in the following order: zd > ted > steam so the final table looks like:但是,我想确保在存在重复项时保留的行按以下顺序优先考虑“源”:zd > ted > steam,因此最终表格如下所示:

  species count year source
1    myli    10 2020     zd
2    myli    15 2021    ted
3    myli    12 2019     zd
4   stili    10 2017    ted
5   stili    10 2018  steam

So essentially the original rows '1', '3','5', '7' and '8' are kept and the duplicate rows '2','4', and '6' are dropped.因此,基本上保留了原始行“1”、“3”、“5”、“7”和“8”,并删除了重复的“2”、“4”和“6”行。

I appreciate any suggestions on how to do that last step to prioritize which original row to keep of the duplicated rows.我很感激有关如何执行最后一步以优先考虑保留重复行的原始行的任何建议。

Thank you very much, Amanda非常感谢,阿曼达

Since your prioritization happens to be in reverse alphabetical order, in this case you can simply arrange(desc(source)) prior to your distinct() call由于您的优先级恰好按字母倒序排列,因此在这种情况下,您可以简单地在distinct()调用之前arrange(desc(source))

mydata %>% 
  arrange(desc(source)) %>% 
  distinct(species,count,year,.keep_all = T)

Output输出

  species count year source
1    myli    10 2020     zd
2    myli    12 2019     zd
3    myli    15 2021    ted
4   stili    10 2017    ted
5   stili    10 2018  steam

Distinct respect the ordering.不同的尊重顺序。 So as your criteria is alphabetically ordered* you can do it as simple as this:因此,由于您的标准是按字母顺序排列的*,您可以像这样简单地做到这一点:

mydata |>
  arrange(desc(source)) |>
  distinct(species, count, year, .keep_all= TRUE)

.* In other cases you'd need to make a variable with the order. .* 在其他情况下,您需要使用订单创建一个变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM