简体   繁体   中英

Delete duplicated rows, selecting specific values to retain

I am working on a raw dataset that looks something like this:

df <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", "Alpha", 
                                "Beta","Beta", "Beta","Beta" ), 
                 "Year" = c(1970, 1970, 1980, 1990, 1970, 1980, 
                            1980,1990), 
                 "Val" = c(1,0,1,1,0,0,0,1),
                 "Val2" = c(0,0,0,1,0,1,0,1),
                 "Val3" = c(0,2.34,1.3,0,0,2.34,3.2,1.3))

The data is a bit dirty as I have multiple observations for each ID and Year identifier - eg I have 2 different rows for Alpha in 1970. The same holds for Beta in 1980.

The issue is that the variable of interest Val Val2 Val3 have different scores in the duplicated rows (in terms of id/year).

I would like to find a concise way to produce the following final dataframe:

final<- data.frame("ID" = c("Alpha", "Alpha", "Alpha", 
                           "Beta", "Beta","Beta" ), 
                  "Year" = c( 1970, 1980, 1990, 1970, 
                             1980,1990), 
                  "Val" = c(1,1,1,0,0,1),
                  "Val2" = c(0,0,1,0,1,1),
                  "Val3" = c(2.34,1,0,0,3.2,1.3))

final

The logic is the following:

1) I want to have only one obs for every ID/year

2) For each variable of interest( Val Val2 Val3 ), I want only to retain the observation with the higher value in the single ID/year output

To illustrate. In df there is a duplicated observation for Alpha/1970. I want to reduce this to a single row. Val takes the value of 0 and 1, and in the final dataframe only 1 is retained. Val2 takes the value of 0 and 0, so I simply keep 0 in final . Val2 takes the value of 2.34 and 1, hence I want to retain in the final dataframe.

I guess I could do this one column at the time using this data.table command

library(dplyr)
val1<- df %>% select(ID:Val)
library(data.table)
setDT(val1)[order(-Val)][,.SD[1,], by = .(ID, Year)]
val2<- df %>% select(ID,Year,Val2)
library(data.table)
setDT(val2)[order(-Val2)][,.SD[1,], by = .(ID, Year)]

and then merge the values together but there are several columns for which I have to run the operation and I was looking for a more concise way to do it.

I hope it is sufficiently clear

Thanks a lot for your help in advance

Here is one option

library(dplyr)
df %>% 
  group_by(ID, Year) %>% 
  summarise_all(max)
  # for specific set of columns, use `summarise_at`
  # summarise_at(vars(starts_with("Val")), max)

or with data.table

library(data.table)
nm1 <- grep('^Val\\d*$", names(df))
setDT(df)[, (nm1) := lapply(.SD, max), .(ID, Year), .SDcols = nm1]

You can do:

aggregate(. ~ ID + Year, FUN =  max, data = df)

     ID Year Val Val2 Val3
1 Alpha 1970   1    0 2.34
2  Beta 1970   0    0 0.00
3 Alpha 1980   1    0 1.30
4  Beta 1980   0    1 3.20
5 Alpha 1990   1    1 0.00
6  Beta 1990   1    1 1.30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM