I have a data.table (r1) that contains duplicated values for site and time. Here I create this examplary data table already showing why these duplicates appear, ie that I initially have two data tables that I merge. Perhaps my problem can be solved already at the stage of merging dt1 and dt2..
dt1 <- data.table(site=c(1,1,2,2), site_type="type1", time=c(1,2,1,2), temp=c(10,12,13,NA), prec=c(10,101,1,1) )
dt2 <- data.table(site=c(3,3,2,2 ), site_type="type2", time=c(1,2,1,2), temp=c(10,12,100,140), prec=c(10,101,1000,NA), snow=c(1,1,1,1))
r1 <- rbindlist(list(dt1,dt2), fill=T)
Now, I would like to aggregate all duplicated rows (4 and 8, as well as 3 and 7) so that for the columns 'temp', 'prec', 'snow' the values will be taken from site_type='type1' unless it is NA.
The dirty solution that I found is to create a subset of r1 containing duplicated rows, and a subset with no duplicates.
duplicates <- r1[duplicated(r1,by=c("site","time")) | duplicated(r1,by=c("site","time"), fromLast=TRUE)]
no_duplicates <- r1[!(duplicated(r1,by=c("site","time")) | duplicated(r1,by=c("site","time"), fromLast=TRUE))]
Then to aggregate the duplicates wiht lapply.
aggregated_duplicates <- duplicates[, lapply(.SD, function(x) ifelse(!is.na(x[site_type=="type1"]),x[site_type=="type1"],x[site_type=="type2"])), by=c("site","time")]
And then to merge the aggregated_duplicates and no_duplicates data tables.
r1_without_duplicates <- rbindlist(list(no_duplicates, aggregated_duplicates), use.names = T)
The approach seems to work, but I have the feeling that in data table this could be done in one line either while merging dt1 and dt2, or by a smart use of 'duplicated' or 'unique'.
Any ideas?
Thanks!
Since there are 2 distinct actions in your requirements, namely, (i) use values from type2 if there is a NA in type1 and (ii) row binding 2 different data.table, I do not think that there is a single liner code to perform both simultaneously.
Another possible approach is to update dt1
for values that are NAs first before performing the rbindlist
and removing dupes:
cols <- c("temp", "prec", "prec")
dt <- copy(dt1)
for (j in cols) {
dt[is.na(get(j)), (j) := dt2[.SD, on=c("site","time"), j, with=FALSE]]
}
r2 <- rbindlist(list(dt, dt2), use.names=TRUE, fill=TRUE)
r2[!duplicated(r2, by=c("site","time"))]
output:
site site_type time temp prec snow
1: 1 type1 1 10 10 NA
2: 1 type1 2 12 101 NA
3: 2 type1 1 13 1 NA
4: 2 type1 2 140 1 NA
5: 3 type2 1 10 10 1
6: 3 type2 2 12 101 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.