I have a large dataset that was built by combining data from multiple sources. Hence, there are a number of rows that are duplicates. I know how to remove duplicates using dplyr and distinct but I would like to have it always keep the row based on a specific value in a cell (source file). Essentially we have a ranking of which sources we prefer. Below is a very simplified dataset to use as an example:
mydata = data.frame (species =c ('myli','myli','myli','myli','myli','stili','stili','stili'),
count = c (10,10,15,15,12,10,10,10),
year =c(2020,2020,2021,2021,2019,2017,2017,2018),
source =c('zd','steam','ted','steam','zd','steam','ted','steam'))
mydata
species count year source
1 myli 10 2020 zd
2 myli 10 2020 steam
3 myli 15 2021 ted
4 myli 15 2021 steam
5 myli 12 2019 zd
6 stili 10 2017 steam
7 stili 10 2017 ted
8 stili 10 2018 steam
I do the following to remove the duplicates:
library(dplyr)
# Remove duplicate rows of the dataframe using 'species', 'count', and 'year' variables
distinct(mydata, species, count, year, .keep_all= TRUE)
species count year source
1 myli 10 2020 zd
2 myli 15 2021 ted
3 myli 12 2019 zd
4 stili 10 2017 steam
5 stili 10 2018 steam
However, I want to ensure that the rows that are kept when there are duplicates prioritize the 'source' in the following order: zd > ted > steam so the final table looks like:
species count year source
1 myli 10 2020 zd
2 myli 15 2021 ted
3 myli 12 2019 zd
4 stili 10 2017 ted
5 stili 10 2018 steam
So essentially the original rows '1', '3','5', '7' and '8' are kept and the duplicate rows '2','4', and '6' are dropped.
I appreciate any suggestions on how to do that last step to prioritize which original row to keep of the duplicated rows.
Thank you very much, Amanda
Since your prioritization happens to be in reverse alphabetical order, in this case you can simply arrange(desc(source))
prior to your distinct()
call
mydata %>%
arrange(desc(source)) %>%
distinct(species,count,year,.keep_all = T)
Output
species count year source
1 myli 10 2020 zd
2 myli 12 2019 zd
3 myli 15 2021 ted
4 stili 10 2017 ted
5 stili 10 2018 steam
Distinct respect the ordering. So as your criteria is alphabetically ordered* you can do it as simple as this:
mydata |>
arrange(desc(source)) |>
distinct(species, count, year, .keep_all= TRUE)
.* In other cases you'd need to make a variable with the order.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.