简体   繁体   English

R,data.table:使用计算出的信息和另一个数据表中的名称以编程方式添加列

[英]R, data.table: programatically add column using calculated info and name from another data table

I have two data tables as follows: 我有两个数据表,如下所示:

set.seed(1)
landing_fees <- data.table(`date`= c(20160101,20160101,20160101,
            20160102,20160102,20160102,20160103,20160103,20160103), 
            airport=c("LAX","PHX","HNL"), period_a_fee=sample(100:200,9),
            period_b_fee=sample(100:200,9), period_c_fee=sample(100:200,9))
routes <- data.table(origin=c(rep("LAX",4),rep("PHX",4)), 
            destination=c("PHX","HNL","PHX","HNL","LAX","HNL","LAX","HNL"),
            period=rep(c("period_a","period_a","period_b","period_c"),2))

Which output as follows: 其输出如下:

> landing_fees
       date airport period_a_fee period_b_fee period_c_fee
1: 20160101     LAX          126          106          138
2: 20160101     PHX          137          120          177
3: 20160101     HNL          156          117          192
4: 20160102     LAX          189          167          120
5: 20160102     PHX          119          137          163
6: 20160102     HNL          186          173          112
7: 20160103     LAX          197          147          125
8: 20160103     PHX          162          197          136
9: 20160103     HNL          158          192          101
> routes
   origin destination   period
1:    LAX         PHX period_a
2:    LAX         HNL period_a
3:    LAX         PHX period_b
4:    LAX         HNL period_c
5:    PHX         LAX period_a
6:    PHX         HNL period_a
7:    PHX         LAX period_b
8:    PHX         HNL period_c

The real datasets are huge with millions of origin/destination combinations and years worth of dates. 真实的数据集非常庞大,具有数百万个原点/目的地组合以及数年的日期。 I would like to add a new column to the routes table for each unique date in landing_fees$date and programatically name the column with the date and _cost (ie 20160101_cost). 我想为landing_fees $ date中的每个唯一日期在路由表中添加一个新列,并以日期和_cost(即20160101_cost)以编程方式命名该列。 Then in each new column, sum the fee for origin and the fee for the destination (using matching time period) as listed in the landing_fees table. 然后,在每个新列中,将landing_fees表中列出的起点费用和目的地费用(使用匹配的时间段)相加。 For instance, in the routes table, a new column routes$20160101_cost would be created and the value of row 1 would be: LAX period_a fee of 126 (from landing_fees table) plus PHX period_a fee of 137 (from landing_fees table) equals 263. Same calculation would be made for each origin/destination/period combination for each date. 例如,在routes表中,将创建新列routes $ 20160101_cost,第1行的值将为:LAX period_a费用为126(来自landing_fees表)加上PHX period_a费用为137(来自landing_fees表)等于263。对于每个日期的每个起点/终点/期间组合,将进行相同的计算。

Final output would look like this: 最终输出如下所示:

    origin  destination period  20160101_cost 20160102_cost 20160103_cost
1:  LAX          PHX    period_a    263          308            359
2:  LAX          HNL    period_a    282          375            355
3:  LAX          PHX    period_b    226          304            344
4:  LAX          HNL    period_c    330          232            226
5:  PHX          LAX    period_a    263          308            359
6:  PHX          HNL    period_a    293          305            320
7:  PHX          LAX    period_b    226          304            344
8:  PHX          HNL    period_c    369          275            237

Since the datasets can be fairly large, I am hoping to use efficient data.table script. 由于数据集可能很大,因此我希望使用高效的data.table脚本。 I am using data.table version 1.9.6 on Windows 7 machine. 我在Windows 7计算机上使用的是data.table版本1.9.6。

You can do this using melt and cast from data.table . 您可以使用melt和从data.table

This flattens your cost matrix for merging: 这使合并的成本矩阵变得平坦:

lf_flat <- melt(landing_fees, id.vars = c("date", "airport"), variable.name = "period")[, period := gsub("_fee", "", period)]

Then merge to the routing table, and collapse cost 然后合并到路由表中,并降低成本

DT <- merge(routes, lf_flat,by.x = c("origin","period"), by.y = c("airport","period"))
DT <- merge(DT, lf_flat, by.x = c("destination","period","date"), by.y = c("airport","period","date"))
DT[, cost := value.x + value.y][,(c("value.x","value.y")) := NULL]

Which gives you the consolidated table: 给您合并表:

DT
    destination   period     date origin cost
 1:         HNL period_a 20160101    LAX  282
 2:         HNL period_a 20160101    PHX  293
 3:         HNL period_a 20160102    LAX  375
 4:         HNL period_a 20160102    PHX  305
 5:         HNL period_a 20160103    LAX  355
 6:         HNL period_a 20160103    PHX  320
 7:         HNL period_c 20160101    LAX  330
 8:         HNL period_c 20160101    PHX  369
 9:         HNL period_c 20160102    LAX  232
10:         HNL period_c 20160102    PHX  275
11:         HNL period_c 20160103    LAX  226
12:         HNL period_c 20160103    PHX  237
13:         LAX period_a 20160101    PHX  263
14:         LAX period_a 20160102    PHX  308
15:         LAX period_a 20160103    PHX  359
16:         LAX period_b 20160101    PHX  226
17:         LAX period_b 20160102    PHX  304
18:         LAX period_b 20160103    PHX  344
19:         PHX period_a 20160101    LAX  263
20:         PHX period_a 20160102    LAX  308
21:         PHX period_a 20160103    LAX  359
22:         PHX period_b 20160101    LAX  226
23:         PHX period_b 20160102    LAX  304
24:         PHX period_b 20160103    LAX  344
    destination   period     date origin cost

I would personally leave it like this, as it gives you the same data as your desired matrix, but with more flexibility. 我个人会这样,因为它可以为您提供与所需矩阵相同的数据,但具有更大的灵活性。 To get your desired state, use: 要获得所需的状态,请使用:

dcast(DT, origin + destination + period ~ date, value.var = "cost")
   origin destination   period 20160101 20160102 20160103
1:    LAX         HNL period_a      282      375      355
2:    LAX         HNL period_c      330      232      226
3:    LAX         PHX period_a      263      308      359
4:    LAX         PHX period_b      226      304      344
5:    PHX         HNL period_a      293      305      320
6:    PHX         HNL period_c      369      275      237
7:    PHX         LAX period_a      263      308      359
8:    PHX         LAX period_b      226      304      344

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM