简体   繁体   English

如何最好地将data.table的一列与同一data.table的另一列连接?

[英]How to best join one column of a data.table with another column of the same data.table?

My data 我的数据

I have a data.table DT with the current ( F0YR ) and the next ( F1YR ) fiscal year-end (FYE) encoded as integers. 我有一个data.table DT ,当前( F0YR )和下一个( F1YR )会计年度末(FYE)编码为整数。 Since every next FYE will eventually become a current FYE, the integer will be both in the column F1YR and F0YR . 由于每个下一个FYE最终将成为当前FYE,因此整数将在F1YRF0YR列中。 Also, my data contains monthly observations so the same FYE will be in the data set multiple times: 此外,我的数据包含每月观察结果,因此相同的FYE将多次出现在数据集中:

library(data.table)
DT <- data.table(ID     = rep(c("A", "B"), each=9),
                 MONTH  = rep(100L:108L, times=2),
                 F0YR   = rep(c(1L, 4L, 7L), each=3, times=2),
                 F1YR   = rep(c(4L, 7L, 9L), each=3, times=2),
                 value  = c(rep(1:5, each=3), 6, 6, 7),
                 key    = "ID,F0YR")
DT
      ID MONTH F0YR F1YR value
 [1,]  A   100    1    4     1
 [2,]  A   101    1    4     1
 [3,]  A   102    1    4     1
 [4,]  A   103    4    7     2
 [5,]  A   104    4    7     2
 [6,]  A   105    4    7     2
 [7,]  A   106    7    9     3
 [8,]  A   107    7    9     3
 [9,]  A   108    7    9     3
[10,]  B   100    1    4     4
[11,]  B   101    1    4     4
...

What I want to do 我想做的事

For every ID and F1YR combination, I want to get the value for the ID and F0YR combination. 对于每个IDF1YR组合,我想获得IDF0YR组合的值。 As an example: Company A had a value of 2 for FOYR==4 . 例如:对于FOYR==4公司A的值为2 Now, I want an additional column for all combinations with ID=="A" and F1YR==4 which is set to 2, next to the already existent value of 1. 现在,我希望为ID=="A"F1YR==4所有组合添加一个附加列,该列设置为2,旁边已经存在的值为1。

What I tried 我尝试了什么

intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
      ID F1YR valueNew MONTH F0YR value
 [1,]  A    4        2   100    1     1
 [2,]  A    4        2   101    1     1
 [3,]  A    4        2   102    1     1
 [4,]  A    7        3   103    4     2
 [5,]  A    7        3   104    4     2
 [6,]  A    7        3   105    4     2
 [7,]  A    9       NA   106    7     3
 [8,]  A    9       NA   107    7     3
 [9,]  A    9       NA   108    7     3
[10,]  B    4        5   100    1     4
[11,]  B    4        5   101    1     4
...

(Note that I use mult="last" here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my tie breaker). (请注意,我在这里使用mult="last" ,因为虽然值只会随着F0YR或F1YR的变化而改变,但有时候它们不会改变,这只是我的领带破坏者)。

What I want 我想要的是

This looks improvable. 这看起来很容易。 First of all, I have to make a copy of my DT. 首先,我必须复制我的DT。 Second, since I join basically the same data.table , all the column names have the same name and I have to rename them. 其次,因为我基本上加入了相同的data.table ,所有列名都有相同的名称,我必须重命名它们。 I thought that a self join would be the way forward, but I tried and tried and couldn't get a nice solution. 我认为self join将是前进的方向,但我尝试并尝试过,无法得到一个好的解决方案。 I have the hope that there is something easy out there which I just don't see...Does anyone have a clue? 我希望有一些简单的东西,我只是没有看到......有没有人有线索? Or is my data set up in such a way that it is actually hard (maybe because I have monthly observations, but want to join only quarterly or yearly changing values). 或者我的数据设置方式实际上很难(可能是因为我有月度观察,但只想加入季度或年度变化值)。

In use cases like this, the mantra "aggregate first, then join with that" often helps. 在这样的用例中,口头禅“首先聚合,然后加入”通常会有所帮助。 So, starting with your DT , and using v1.8.1 : 因此,从您的DT开始,并使用v1.8.1:

> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
   ID F0YR V1
1:  A    1  1
2:  A    4  2
3:  A    7  3
4:  B    1  4
5:  B    4  5
6:  B    7  7

I called it agg because I couldn't think of a better name. 我称之为agg因为我想不出一个更好的名字。 In this case you wanted last which isn't really an aggregate as such, but you know what I mean. 在这种情况下,你想要的last并不是真正的聚合,但是你知道我的意思。

Then update DT by reference by group. 然后按组引用更新DT Here we're grouping by i . 这里我们通过分组i

setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
    ID MONTH F0YR F1YR value newcol
 1:  A   100    1    4     1      2
 2:  A   101    1    4     1      2
 3:  A   102    1    4     1      2
 4:  A   103    4    7     2      3
 5:  A   104    4    7     2      3
 6:  A   105    4    7     2      3
 7:  A   106    7    9     3     NA
 8:  A   107    7    9     3     NA
 9:  A   108    7    9     3     NA
10:  B   100    1    4     4      5
11:  B   101    1    4     4      5
12:  B   102    1    4     4      5
13:  B   103    4    7     5      7
14:  B   104    4    7     5      7
15:  B   105    4    7     5      7
16:  B   106    7    9     6     NA
17:  B   107    7    9     6     NA
18:  B   108    7    9     7     NA

Is that right? 是对的吗? Not sure I fully followed. 不确定我是否完全遵循。 Those ops should be very fast, without any copies, and should scale to large data. 那些操作应该非常快,没有任何副本,并且应该扩展到大数据。 At least, that's the intention. 至少,这是意图。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM