简体   繁体   中英

How to aggregate duplicated rows in data.table by applying a specific function

I have a data.table (r1) that contains duplicated values for site and time. Here I create this examplary data table already showing why these duplicates appear, ie that I initially have two data tables that I merge. Perhaps my problem can be solved already at the stage of merging dt1 and dt2..

dt1 <- data.table(site=c(1,1,2,2), site_type="type1", time=c(1,2,1,2), temp=c(10,12,13,NA), prec=c(10,101,1,1) )
dt2 <- data.table(site=c(3,3,2,2 ), site_type="type2", time=c(1,2,1,2), temp=c(10,12,100,140), prec=c(10,101,1000,NA), snow=c(1,1,1,1))
r1 <- rbindlist(list(dt1,dt2), fill=T)

Now, I would like to aggregate all duplicated rows (4 and 8, as well as 3 and 7) so that for the columns 'temp', 'prec', 'snow' the values will be taken from site_type='type1' unless it is NA.

The dirty solution that I found is to create a subset of r1 containing duplicated rows, and a subset with no duplicates.

duplicates <- r1[duplicated(r1,by=c("site","time")) | duplicated(r1,by=c("site","time"), fromLast=TRUE)]

no_duplicates <- r1[!(duplicated(r1,by=c("site","time")) | duplicated(r1,by=c("site","time"), fromLast=TRUE))]

Then to aggregate the duplicates wiht lapply.

aggregated_duplicates <- duplicates[, lapply(.SD, function(x) ifelse(!is.na(x[site_type=="type1"]),x[site_type=="type1"],x[site_type=="type2"])), by=c("site","time")]

And then to merge the aggregated_duplicates and no_duplicates data tables.

r1_without_duplicates <- rbindlist(list(no_duplicates, aggregated_duplicates), use.names = T)

The approach seems to work, but I have the feeling that in data table this could be done in one line either while merging dt1 and dt2, or by a smart use of 'duplicated' or 'unique'.

Any ideas?

Thanks!

Since there are 2 distinct actions in your requirements, namely, (i) use values from type2 if there is a NA in type1 and (ii) row binding 2 different data.table, I do not think that there is a single liner code to perform both simultaneously.

Another possible approach is to update dt1 for values that are NAs first before performing the rbindlist and removing dupes:

cols <- c("temp", "prec", "prec")
dt <- copy(dt1)
for (j in cols) {
    dt[is.na(get(j)), (j) := dt2[.SD, on=c("site","time"), j, with=FALSE]]
}
r2 <- rbindlist(list(dt, dt2), use.names=TRUE, fill=TRUE)
r2[!duplicated(r2, by=c("site","time"))]

output:

   site site_type time temp prec snow
1:    1     type1    1   10   10   NA
2:    1     type1    2   12  101   NA
3:    2     type1    1   13    1   NA
4:    2     type1    2  140    1   NA
5:    3     type2    1   10   10    1
6:    3     type2    2   12  101    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM