[英]Compute mean and nmiss by group with multiple combinations in R, data.table
I'd like to compute the mean and count number of NA across several group combination with a large dataset. 我想在一个大型数据集的多个组组合中计算NA的平均值和计数。 This is probably easiest to explain with some test data. 用一些测试数据可能最容易解释。 I'm using the latest version of R on a Macbook Pro, and the data.table package (data is large, >1M rows). 我在Macbook Pro上使用的是最新版本的R,并且使用了data.table包(数据量大,行数大于1M)。 (note: I noticed after posting this that I accidentally used sum() instead of mean() for the "m = " variables below. I haven't edited it because I don't want to re-run everything, and don't think it matters that much) (注意:我在发布此内容后注意到,我不小心对下面的“ m =”变量使用了sum()而不是mean()。我没有对其进行编辑,因为我不想重新运行所有内容,并且认为那么重要)
set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
ID2 = data.table(id2 = 20001:20050)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
DT <- DT[,as.list(ID2), by = .(yr, id)] # intentional cartesian join
rm("YR","ID","ID2")
# 2.7M obs, now add data
DT[,`:=` (ratio = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio %% 5) == 0, ratio:=NA] # make some of the ratios NA
DT[,`:=` (keep = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
# do it again
DT[,`:=` (ratio2 = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio2 %% 4) == 0, ratio2:=NA] # make some of the ratios NA
DT[,`:=` (keep2 = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
So, what I have is identifying info (yr, id, id2) and the data I want to summarize: keep1|2, ratio1|2. 因此,我所拥有的是识别信息(yr,id,id2)和我要总结的数据:keep1 | 2,ratio1 | 2。 Specifically by yr-id, I want to compute the average ratio and ratio2 using keep and keep2 (thus compressing id2). 具体来说,通过yr-id,我想使用keep和keep2(因此压缩id2)来计算平均比率和ratio2。 I've thought of doing this either by subsetting by keep/keep2 the computing ratio and ratio2 or by matrix multiplication of keep*ratio, keep2*ratio, keep*ratio2 and keep2*ratio2. 我考虑过通过保持/保持2的计算比率和ratio2子集或通过keep * ratio,keep2 * ratio,keep * ratio2和keep2 * ratio2的矩阵乘法来进行设置。
First, the way I'm doing this that gets the right answer, but is slow: 首先,我这样做的方式会得到正确的答案,但是很慢:
system.time(test1 <- DT[,.SD[keep == 1,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.083 0.191 23.319
This works just as well in about the same time. 大约在同一时间效果也一样。 I thought it might be faster to subset the main data first rather than within .SD: 我认为,首先将主要数据子集而不是在.SD中进行子集化可能会更快:
system.time(test2 <- DT[keep == 1,.SD[,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.723 0.208 23.963
The problem with either of these approaches is that I need to do separate computations for each keep
variable. 这两种方法的问题在于,我需要对每个keep
变量分别进行计算。 Thus I tried this way: 因此,我尝试了这种方式:
system.time(test3 <- DT[,.SD[,.( m = sum(ratio*keep,na.rm = TRUE),
nmiss = sum(is.na(ratio*keep)) )
],by=.(yr,id)])
user system elapsed
25.997 0.191 26.217
This allows me to put all the formulas together (I could add in ratio*keep2
, ratio2*keep
and ratio2*keep2
) but 1. it is slower and 2. it is not getting the correct number of NAs (see the nmiss
column): 这使我可以将所有公式放在一起(我可以添加ratio*keep2
, ratio2*keep
和ratio2*keep2
),但是1.速度较慢,并且2.没有获得正确数量的NA(请参阅nmiss
列) :
> summary(test1)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test2)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test3)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. : 0.00
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.: 2.00
Median :1988 Median :10500 Median : 53.828 Median : 4.00
Mean :1988 Mean :10500 Mean : 59.653 Mean : 4.99
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.: 8.00
Max. :2015 Max. :11000 Max. :211.552 Max. :20.00
What is the fastest way to get my 4 combinations of summarized info by yr-id? 用yr-id获取我的4种汇总信息组合的最快方法是什么? Right now, I'm using option 1 or 2 repeated twice (once for keep, again for keep2) 现在,我将选项1或2重复了两次(一次用于keep,另一次用于keep2)
You can do summarization directly in expression in j
: 您可以直接在j
表达式中进行汇总:
# solution A: summarize in `.SD`:
system.time({
test2 <- DT[keep == 1,
.SD[, .(m = sum(ratio, na.rm = TRUE),
nmiss = sum(is.na(ratio)))],
by = .(yr, id), verbose = T]
})
# user system elapsed
# 22.359 0.439 22.561
# solution B: summarize directly in j:
system.time({
test2 <- DT[keep == 1, .(m = sum(ratio, na.rm = T),
nmiss = sum(is.na(ratio))),
by = .(yr, id), verbose = T]
})
# user system elapsed
# 0.118 0.077 0.195
verbose = T
is added to show the difference between the two approaches: verbose = T
添加以显示两种方法之间的差异:
for solution A: 对于解决方案A:
lapply optimization is on, j unchanged as '.SD[, list(m = sum(ratio, na.rm = TRUE), nmiss = sum(is.na(ratio)))]' GForce is on, left j unchanged lapply优化打开,j不变为'.SD [,list(m = sum(ratio,na.rm = TRUE),nmiss = sum(is.na(ratio)))]'GForce打开,j不变
Old mean optimization is on, left j unchanged. 旧的均值优化功能已启用,j保持不变。
Making each group and running j (GForce FALSE) ... The result of j is 制作每个组并运行j(GForce FALSE)... j的结果是
a named list. 命名列表。 It's very inefficient to create the same names over and over again for each group. 为每个组一遍又一遍地创建相同的名称是非常低效的。
When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. 当j = list(...)时,为了提高效率,会在分组完成后检测,删除并放回任何名称。 Using j=transform(), for example, prevents that speedup (consider changing to :=). 例如,使用j = transform()可以防止加速(考虑更改为:=)。 This message may be upgraded to warning in future. 此消息将来可能会升级为警告。
collecting discontiguous groups took 0.058s for 54000 groups 收集不连续的组54000个组耗时0.058s
eval(j) took 22.487s for 54000 calls 22.521 secs eval(j)花费22.487秒进行了54000次通话22.521秒
For solution B: 对于解决方案B:
... ...
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec lapply optimization is on, j unchanged as 'list(sum(ratio, na.rm = T), sum(is.na(ratio)))' 从位置查找组大小(可以避免以节省RAM)... 0秒启用了优化,j不变为'list(sum(ratio,na.rm = T),sum(is.na(ratio)) )'
GForce is on, left j unchanged GForce已开启,j保持不变
Old mean optimization is on, left j unchanged. 旧的均值优化功能已启用,j保持不变。 Making each group and running j (GForce FALSE) ... collecting discontiguous groups took 0.027s for 54000 groups eval(j) took 0.079s for 54000 calls 0.168 secs 制作每个组并运行j(GForce FALSE)...收集不连续的组花费了54000个组0.027秒eval(j)花费了54000个调用0.079秒0.168秒
The main difference is that the summarization in B is treated as named list, which is extremely slow when there are many groups (54k groups for this data!). 主要区别在于B中的汇总被视为命名列表,当有许多组(此数据为54k组!)时,这非常慢。 For a similar benchmark of this type see this one . 对于这种类型的类似基准看到这一个 。
For the second part(your test3): We didn't filter columns by keep = 1
first. 对于第二部分(您的test3):我们没有首先通过keep = 1
过滤列。 So those NA
s where keep !=
is also counted in nmiss
. 因此,其中keep !=
那些NA
也计入nmiss
。 Therefore, the count of NA
s are different. 因此, NA
的数量不同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.