[英]creating new column after joining two data.tables
I have two data.tables, main
and metrics
, both keyed by cid
I want to add to table main
the average of each of several values located in metrics. 我有两个data.tables,
main
和metrics
,都由cid
键入我想要向表main
添加指标中每个值的平均值。
However, I would like to filter by code
, only averaging those rows in metrics
with a given code
. 但是,我想按
code
进行过滤,仅使用给定code
对metrics
行进行平均。
> metrics
cid code DZ value1 value2
1: 1001 A 101 8 21
2: 1001 B 102 11 26
3: 1001 A 103 17 25
4: 1002 A 104 25 39
5: 1002 B 105 6 30
6: 1002 A 106 23 40
7: 1003 A 107 27 32
8: 1003 B 108 16 37
9: 1003 A 109 14 42
# DESIRED OUTPUT
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37
# SAMPLE DATA
set.seed(1)
main <- data.table(cid=1e3+1:3, key="cid")
metrics <- data.table(cid=rep(1e3+1:3, each=3), code=rep(c("A", "B", "A"), 3), DZ=101:109, value1=sample(30, 9), value2=sample(20:50, 9), key="cid")
code.filters <- c("A", "B")
These lines get the desired output, but I am having difficulty assigning the new col back into main. 这些行获得了所需的输出,但是我很难将新的col分配回main。 (also, doing it programatically would be preferred).
(另外,以编程方式进行是首选)。
main[metrics[code==code.filters[[1]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
main[metrics[code==code.filters[[2]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
Additionally, can someone explain why the following line only takes the last value in each group? 此外,有人可以解释为什么以下行只取每组中的最后一个值?
main[metrics[ code=="A"], A.avg.val1 := mean(c(value1))]
You don't need main
. 你不需要
main
。 You can get it directly from metrics
as follows: 您可以直接从
metrics
获取它,如下所示:
> tmp.dt <- metrics[, list(A.avg.val1 = mean(value1[code=="A"]),
A.avg.val2 = mean(value2[code=="A"]),
B.avg.val1 = mean(value1[code == "B"]),
B.avg.val2 = mean(value2[code == "B"])), by=cid]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
# 3: 1003 20.5 37.0 16 37
If you still want to subset with main
just do: 如果您仍想使用
main
进行子集,请执行以下操作:
main <- data.table(cid = c(1001:1002))
> tmp.dt[main]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
I would do this in two steps. 我会分两步完成。 First, get your means, then
reshape
the data 首先,获取方法,然后
reshape
数据
foo <- main[metrics]
bar <- foo[, list(val1 = mean(value1),
val2 = mean(value2)),
by=c('cid', 'code')]
library(reshape2)
bar.melt <- melt(bar, id.var=c('cid', 'code'))
dcast(data=bar.melt,
cid ~ code + variable)
But really, I'd probably leave the data in the "long" format because I find it much easier to work with! 但实际上,我可能会将数据保留为“长”格式,因为我发现它更容易使用!
working off of @Arun's answer, the following gets the desired results: 根据@ Arun的回答,以下内容得到了预期的结果:
invisible(
sapply(code.filters, function(cf)
main[metrics[code==cf, list(avgv1 = mean(value1), avgv2 = mean(value2)), by=cid],
paste0(cf, c(".avg.val1", ".avg.val2")) :=list(avgv1, avgv2)]
))
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.