简体   繁体   English

R在data.table中部分填充NA

[英]R filling partially NA in data.table

I have the following data.table: 我有以下data.table:

dt <- data.table(date=rep(c(2014,2013), each=4), price=c(3.14, 1.45, 3.4 ,5.1, 1, 2.3, 2.79, 3), brand=rep(c("Mercedes", "Audi"), each=4), num=c(3,6,7,8,3,5,9,12), seller=rep(c("gregory", "dan"), each=4))

Resulting in: 导致:

   date price    brand num  seller
1: 2013  1.00     Audi   3     dan
2: 2013  2.30     Audi   5     dan
3: 2013  2.79     Audi   9     dan
4: 2013  3.00     Audi  12     dan
5: 2014  3.14 Mercedes   3 gregory
6: 2014  1.45 Mercedes   6 gregory
7: 2014  3.40 Mercedes   7 gregory
8: 2014  5.10 Mercedes   8 gregory

My target is now to have this: 我现在的目标是:

    date num price    brand  seller
 1: 2013   3  1.00     Audi     dan
 2: 2013   5  2.30     Audi     dan
 3: 2013   6    NA     Audi     dan
 4: 2013   7    NA     Audi     dan
 5: 2013   8    NA     Audi     dan
 6: 2013   9  2.79     Audi     dan
 7: 2013  12  3.00     Audi     dan
 8: 2014   3  3.14 Mercedes gregory
 9: 2014   5    NA Mercedes gregory
10: 2014   6  1.45 Mercedes gregory
11: 2014   7  3.40 Mercedes gregory
12: 2014   8  5.10 Mercedes gregory
13: 2014   9    NA Mercedes gregory
14: 2014  12    NA Mercedes gregory

I first add lines for the missing num for every date: 我首先为每个日期为缺少的数字添加行:

setkey(dt, date, num)
dtt<-dt[CJ(unique(date), unique(dt[,num]))]

Giving this first step: 第一步:

    date num price    brand  seller
 1: 2013   3  1.00     Audi     dan
 2: 2013   5  2.30     Audi     dan
 3: 2013   6    NA       NA      NA
 4: 2013   7    NA       NA      NA
 5: 2013   8    NA       NA      NA
 6: 2013   9  2.79     Audi     dan
 7: 2013  12  3.00     Audi     dan
 8: 2014   3  3.14 Mercedes gregory
 9: 2014   5    NA       NA      NA
10: 2014   6  1.45 Mercedes gregory
11: 2014   7  3.40 Mercedes gregory
12: 2014   8  5.10 Mercedes gregory
13: 2014   9    NA       NA      NA
14: 2014  12    NA       NA      NA

And then: 接着:

dtt[date==2013, c("brand","seller"):=list("Audi","dan")]
dtt[date==2014, c("brand","seller"):=list("Mercedes","gregory")]

Gives the wanted result. 给出想要的结果。

However: 然而:

1 - the last piece of code is awfull. 1-最后一段代码糟糕透顶。

2 - I would like to make a generic function (or a join) because I have lots of different dates and columns to replace/keep the NA's in my real data.table. 2-我想创建一个泛型函数(或联接),因为我有很多不同的日期和列来替换/保留真实data.table中的NA。

It seems simple but I am stuck! 看起来很简单,但是我被卡住了!

How about: 怎么样:

require(data.table) ## 1.9.2
setkey(dt, num)
nums = unique(dt$num)
dt[, list(price=.SD[J(nums)]$price, brand=brand[1L], 
          num=nums, seller=seller[1L]), by=date]
#     date price    brand num  seller
#  1: 2014  3.14 Mercedes   3 gregory
#  2: 2014    NA Mercedes   5 gregory
#  3: 2014  1.45 Mercedes   6 gregory
#  4: 2014  3.40 Mercedes   7 gregory
#  5: 2014  5.10 Mercedes   8 gregory
#  6: 2014    NA Mercedes   9 gregory
#  7: 2014    NA Mercedes  12 gregory
#  8: 2013  1.00     Audi   3     dan
#  9: 2013  2.30     Audi   5     dan
# 10: 2013    NA     Audi   6     dan
# 11: 2013    NA     Audi   7     dan
# 12: 2013    NA     Audi   8     dan
# 13: 2013  2.79     Audi   9     dan
# 14: 2013  3.00     Audi  12     dan

or alternatively: 或者:

dt[, c(.SD[J(nums), list(price=price)], brand=brand[1L], 
           seller=seller[1L]), by=date]

where the order of columns will be different. 列的顺序会有所不同。


In 1.9.3 , this'll be much more efficient (in terms of both syntax and speed), because we don't have to join and return all the columns: 1.9.3 ,这将效率更高(就语法和速度而言),因为我们不必联接并返回所有列:

## 1.9.3
dt[, list(price=.SD[J(nums), price], brand=brand[1L], 
          num=nums, seller=seller[1L]), by=date]

.SD[J(nums), price] will result in a vector, as opposed to a data.table in previous versions and will not perform an implicit by (by-without-by) and will therefore be faster as well. .SD[J(nums), price]将产生一个向量,与之前版本中的data.table相反,并且将不执行隐式by(逐个by-by),因此也将更快。

Have a look at under the new FRs implemented (points 1 and 2) for v1.9.3 here for details. 请查看此处针对v1.9.3实施的新FR(第1点和第2点)的详细信息。

HTH HTH

You could use the roll argument to fill the NA 's with nearest values. 您可以使用roll参数以最接近的值填充NA The problem is that will also fill the price , but that's easy to remedy: 问题是,这也将填补price ,但这很容易补救:

setkey(dt, date, num)

dt[CJ(unique(date), unique(num)), roll = 'nearest'][!dt, price := NA][]
#    date price    brand num  seller
# 1: 2013  1.00     Audi   3     dan
# 2: 2013  2.30     Audi   5     dan
# 3: 2013    NA     Audi   6     dan
# 4: 2013    NA     Audi   7     dan
# 5: 2013    NA     Audi   8     dan
# 6: 2013  2.79     Audi   9     dan
# 7: 2013  3.00     Audi  12     dan
# 8: 2014  3.14 Mercedes   3 gregory
# 9: 2014    NA Mercedes   5 gregory
#10: 2014  1.45 Mercedes   6 gregory
#11: 2014  3.40 Mercedes   7 gregory
#12: 2014  5.10 Mercedes   8 gregory
#13: 2014    NA Mercedes   9 gregory
#14: 2014    NA Mercedes  12 gregory

I think this should be much faster than the .SD[...] solution. 我认为这应该比.SD[...]解决方案快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM