I have the following data.table:
dt <- data.table(date=rep(c(2014,2013), each=4), price=c(3.14, 1.45, 3.4 ,5.1, 1, 2.3, 2.79, 3), brand=rep(c("Mercedes", "Audi"), each=4), num=c(3,6,7,8,3,5,9,12), seller=rep(c("gregory", "dan"), each=4))
Resulting in:
date price brand num seller
1: 2013 1.00 Audi 3 dan
2: 2013 2.30 Audi 5 dan
3: 2013 2.79 Audi 9 dan
4: 2013 3.00 Audi 12 dan
5: 2014 3.14 Mercedes 3 gregory
6: 2014 1.45 Mercedes 6 gregory
7: 2014 3.40 Mercedes 7 gregory
8: 2014 5.10 Mercedes 8 gregory
My target is now to have this:
date num price brand seller
1: 2013 3 1.00 Audi dan
2: 2013 5 2.30 Audi dan
3: 2013 6 NA Audi dan
4: 2013 7 NA Audi dan
5: 2013 8 NA Audi dan
6: 2013 9 2.79 Audi dan
7: 2013 12 3.00 Audi dan
8: 2014 3 3.14 Mercedes gregory
9: 2014 5 NA Mercedes gregory
10: 2014 6 1.45 Mercedes gregory
11: 2014 7 3.40 Mercedes gregory
12: 2014 8 5.10 Mercedes gregory
13: 2014 9 NA Mercedes gregory
14: 2014 12 NA Mercedes gregory
I first add lines for the missing num for every date:
setkey(dt, date, num)
dtt<-dt[CJ(unique(date), unique(dt[,num]))]
Giving this first step:
date num price brand seller
1: 2013 3 1.00 Audi dan
2: 2013 5 2.30 Audi dan
3: 2013 6 NA NA NA
4: 2013 7 NA NA NA
5: 2013 8 NA NA NA
6: 2013 9 2.79 Audi dan
7: 2013 12 3.00 Audi dan
8: 2014 3 3.14 Mercedes gregory
9: 2014 5 NA NA NA
10: 2014 6 1.45 Mercedes gregory
11: 2014 7 3.40 Mercedes gregory
12: 2014 8 5.10 Mercedes gregory
13: 2014 9 NA NA NA
14: 2014 12 NA NA NA
And then:
dtt[date==2013, c("brand","seller"):=list("Audi","dan")]
dtt[date==2014, c("brand","seller"):=list("Mercedes","gregory")]
Gives the wanted result.
However:
1 - the last piece of code is awfull.
2 - I would like to make a generic function (or a join) because I have lots of different dates and columns to replace/keep the NA's in my real data.table.
It seems simple but I am stuck!
How about:
require(data.table) ## 1.9.2
setkey(dt, num)
nums = unique(dt$num)
dt[, list(price=.SD[J(nums)]$price, brand=brand[1L],
num=nums, seller=seller[1L]), by=date]
# date price brand num seller
# 1: 2014 3.14 Mercedes 3 gregory
# 2: 2014 NA Mercedes 5 gregory
# 3: 2014 1.45 Mercedes 6 gregory
# 4: 2014 3.40 Mercedes 7 gregory
# 5: 2014 5.10 Mercedes 8 gregory
# 6: 2014 NA Mercedes 9 gregory
# 7: 2014 NA Mercedes 12 gregory
# 8: 2013 1.00 Audi 3 dan
# 9: 2013 2.30 Audi 5 dan
# 10: 2013 NA Audi 6 dan
# 11: 2013 NA Audi 7 dan
# 12: 2013 NA Audi 8 dan
# 13: 2013 2.79 Audi 9 dan
# 14: 2013 3.00 Audi 12 dan
or alternatively:
dt[, c(.SD[J(nums), list(price=price)], brand=brand[1L],
seller=seller[1L]), by=date]
where the order of columns will be different.
In 1.9.3
, this'll be much more efficient (in terms of both syntax and speed), because we don't have to join and return all the columns:
## 1.9.3
dt[, list(price=.SD[J(nums), price], brand=brand[1L],
num=nums, seller=seller[1L]), by=date]
.SD[J(nums), price]
will result in a vector, as opposed to a data.table
in previous versions and will not perform an implicit by (by-without-by) and will therefore be faster as well.
Have a look at under the new FRs implemented (points 1 and 2) for v1.9.3 here for details.
HTH
You could use the roll
argument to fill the NA
's with nearest values. The problem is that will also fill the price
, but that's easy to remedy:
setkey(dt, date, num)
dt[CJ(unique(date), unique(num)), roll = 'nearest'][!dt, price := NA][]
# date price brand num seller
# 1: 2013 1.00 Audi 3 dan
# 2: 2013 2.30 Audi 5 dan
# 3: 2013 NA Audi 6 dan
# 4: 2013 NA Audi 7 dan
# 5: 2013 NA Audi 8 dan
# 6: 2013 2.79 Audi 9 dan
# 7: 2013 3.00 Audi 12 dan
# 8: 2014 3.14 Mercedes 3 gregory
# 9: 2014 NA Mercedes 5 gregory
#10: 2014 1.45 Mercedes 6 gregory
#11: 2014 3.40 Mercedes 7 gregory
#12: 2014 5.10 Mercedes 8 gregory
#13: 2014 NA Mercedes 9 gregory
#14: 2014 NA Mercedes 12 gregory
I think this should be much faster than the .SD[...]
solution.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.