简体   繁体   中英

R filling partially NA in data.table

I have the following data.table:

dt <- data.table(date=rep(c(2014,2013), each=4), price=c(3.14, 1.45, 3.4 ,5.1, 1, 2.3, 2.79, 3), brand=rep(c("Mercedes", "Audi"), each=4), num=c(3,6,7,8,3,5,9,12), seller=rep(c("gregory", "dan"), each=4))

Resulting in:

   date price    brand num  seller
1: 2013  1.00     Audi   3     dan
2: 2013  2.30     Audi   5     dan
3: 2013  2.79     Audi   9     dan
4: 2013  3.00     Audi  12     dan
5: 2014  3.14 Mercedes   3 gregory
6: 2014  1.45 Mercedes   6 gregory
7: 2014  3.40 Mercedes   7 gregory
8: 2014  5.10 Mercedes   8 gregory

My target is now to have this:

    date num price    brand  seller
 1: 2013   3  1.00     Audi     dan
 2: 2013   5  2.30     Audi     dan
 3: 2013   6    NA     Audi     dan
 4: 2013   7    NA     Audi     dan
 5: 2013   8    NA     Audi     dan
 6: 2013   9  2.79     Audi     dan
 7: 2013  12  3.00     Audi     dan
 8: 2014   3  3.14 Mercedes gregory
 9: 2014   5    NA Mercedes gregory
10: 2014   6  1.45 Mercedes gregory
11: 2014   7  3.40 Mercedes gregory
12: 2014   8  5.10 Mercedes gregory
13: 2014   9    NA Mercedes gregory
14: 2014  12    NA Mercedes gregory

I first add lines for the missing num for every date:

setkey(dt, date, num)
dtt<-dt[CJ(unique(date), unique(dt[,num]))]

Giving this first step:

    date num price    brand  seller
 1: 2013   3  1.00     Audi     dan
 2: 2013   5  2.30     Audi     dan
 3: 2013   6    NA       NA      NA
 4: 2013   7    NA       NA      NA
 5: 2013   8    NA       NA      NA
 6: 2013   9  2.79     Audi     dan
 7: 2013  12  3.00     Audi     dan
 8: 2014   3  3.14 Mercedes gregory
 9: 2014   5    NA       NA      NA
10: 2014   6  1.45 Mercedes gregory
11: 2014   7  3.40 Mercedes gregory
12: 2014   8  5.10 Mercedes gregory
13: 2014   9    NA       NA      NA
14: 2014  12    NA       NA      NA

And then:

dtt[date==2013, c("brand","seller"):=list("Audi","dan")]
dtt[date==2014, c("brand","seller"):=list("Mercedes","gregory")]

Gives the wanted result.

However:

1 - the last piece of code is awfull.

2 - I would like to make a generic function (or a join) because I have lots of different dates and columns to replace/keep the NA's in my real data.table.

It seems simple but I am stuck!

How about:

require(data.table) ## 1.9.2
setkey(dt, num)
nums = unique(dt$num)
dt[, list(price=.SD[J(nums)]$price, brand=brand[1L], 
          num=nums, seller=seller[1L]), by=date]
#     date price    brand num  seller
#  1: 2014  3.14 Mercedes   3 gregory
#  2: 2014    NA Mercedes   5 gregory
#  3: 2014  1.45 Mercedes   6 gregory
#  4: 2014  3.40 Mercedes   7 gregory
#  5: 2014  5.10 Mercedes   8 gregory
#  6: 2014    NA Mercedes   9 gregory
#  7: 2014    NA Mercedes  12 gregory
#  8: 2013  1.00     Audi   3     dan
#  9: 2013  2.30     Audi   5     dan
# 10: 2013    NA     Audi   6     dan
# 11: 2013    NA     Audi   7     dan
# 12: 2013    NA     Audi   8     dan
# 13: 2013  2.79     Audi   9     dan
# 14: 2013  3.00     Audi  12     dan

or alternatively:

dt[, c(.SD[J(nums), list(price=price)], brand=brand[1L], 
           seller=seller[1L]), by=date]

where the order of columns will be different.


In 1.9.3 , this'll be much more efficient (in terms of both syntax and speed), because we don't have to join and return all the columns:

## 1.9.3
dt[, list(price=.SD[J(nums), price], brand=brand[1L], 
          num=nums, seller=seller[1L]), by=date]

.SD[J(nums), price] will result in a vector, as opposed to a data.table in previous versions and will not perform an implicit by (by-without-by) and will therefore be faster as well.

Have a look at under the new FRs implemented (points 1 and 2) for v1.9.3 here for details.

HTH

You could use the roll argument to fill the NA 's with nearest values. The problem is that will also fill the price , but that's easy to remedy:

setkey(dt, date, num)

dt[CJ(unique(date), unique(num)), roll = 'nearest'][!dt, price := NA][]
#    date price    brand num  seller
# 1: 2013  1.00     Audi   3     dan
# 2: 2013  2.30     Audi   5     dan
# 3: 2013    NA     Audi   6     dan
# 4: 2013    NA     Audi   7     dan
# 5: 2013    NA     Audi   8     dan
# 6: 2013  2.79     Audi   9     dan
# 7: 2013  3.00     Audi  12     dan
# 8: 2014  3.14 Mercedes   3 gregory
# 9: 2014    NA Mercedes   5 gregory
#10: 2014  1.45 Mercedes   6 gregory
#11: 2014  3.40 Mercedes   7 gregory
#12: 2014  5.10 Mercedes   8 gregory
#13: 2014    NA Mercedes   9 gregory
#14: 2014    NA Mercedes  12 gregory

I think this should be much faster than the .SD[...] solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM