简体   繁体   English

在R data.table中循环:通过变量分布创建组,然后通过新组计算均值

[英]Loop in R data.table: Create groups by variable distribution, then compute means by new groups

I have a set of variables (about 21) that I want to loop through and do the following for each: 1. Group into 10 groups, by year, where groups are determined by deciles of annual distribution 2. Compute means (equal and weighted) based on those new groups. 我要遍历一组变量(大约21个),并对每个变量执行以下操作:1.按年份分为10组,其中组由年度分布的十分位数确定。2.计算平均值(均等和加权) )基于这些新组。

Test data: 测试数据:

set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
rm("YR","ID")
# 54,000 obs now add data
DT[,`:=` (ratio = rep(sample(10),each=5400)+rnorm(nrow(DT)),
          ratio2 = rep(sample(5),each=10800)+rnorm(nrow(DT)),
          weight = abs(rnorm(nrow(DT)))*100,
          val = rnorm(nrow(DT))
            )]
DT
         yr    id     ratio   ratio2    weight        val
    1: 1962 10001  6.689275 4.895357 129.10487 -0.2022073
    2: 1962 10002  4.718753 4.505419 140.70420 -0.0887587
    3: 1962 10003  5.786855 4.359488 242.10988  0.9511465
    4: 1962 10004  7.896540 4.049974  89.23235 -1.3822148
    5: 1962 10005  7.776863 2.233036 177.79650 -1.0671091
   ---                                                   
53996: 2015 10996 10.613272 3.345091 153.81424  0.9269429
53997: 2015 10997 11.260932 1.804315  15.68129 -1.6618414
53998: 2015 10998  8.591909 3.332643 134.80929 -1.1632596
53999: 2015 10999  9.143039 3.012160 178.77301 -0.4761060
54000: 2015 11000  7.470945 4.068919 121.13470 -1.7594423

So, I'd like to loop through ratio, then ratio2, etc. , computing deciles of each, then summarizing val by each of those newly computed deciles. 因此,我想循环遍历ratio,然后遍历ratio2等,计算每个的十进制,然后通过每个新计算的十进制总结val。 Note, these are not numbered variables so I can't recreate the names with paste() and a 1:21 vector. 请注意,这些不是编号变量,因此我无法使用paste()和1:21向量重新创建名称。 First, I wrote this function to do the grouping: 首先,我编写了此函数来进行分组:

# [function] pctl.grp - order data into groups based on percentil breakpoints
# Number of groups passed 
pctl.grp <- function(dat, grp) {
  bp <- quantile(dat, probs = c(0,seq(100/grp,100,100/grp))/100)
  cut(dat,bp,labels = FALSE, include.lowest = TRUE)
}

Then I can do one iteration like this: 然后,我可以像这样进行一次迭代:

# adds in new variable containing 10 groups numbered 1-10
DT[,ratiogrp := lapply(.SD, pctl.grp, 10), by = .(yr), .SDcols = c("ratio")]

DT[,.(ewval = mean(val), 
      ewratio = mean(ratio),
      vwval = weighted.mean(val, weight, na.rm = TRUE), 
      vwratio = weighted.mean(ratio, weight, na.rm = TRUE))  ,by=ratiogrp][order(ratiogrp)]

Which gives the desired result: 给出所需的结果:

    ratiogrp        ewval  ewratio        vwval  vwratio
 1:        1 -0.027994385 3.576939 -0.039512050 3.572319
 2:        2 -0.001146009 4.329835  0.005093692 4.331433
 3:        3 -0.009087386 4.784103 -0.012764902 4.767494
 4:        4 -0.014961467 5.094431 -0.015464918 5.110614
 5:        5  0.014705294 5.373705  0.015276699 5.364962
 6:        6 -0.010195630 5.645182 -0.014102394 5.618484
 7:        7  0.001297953 5.949583 -0.012839401 5.925634
 8:        8 -0.009300910 6.265297 -0.007141404 6.263371
 9:        9  0.012970539 6.651047  0.018474949 6.684825
10:       10  0.003841495 7.363449 -0.004225650 7.351828

But how do I do this 21 times looping through each variable? 但是,如何在每个变量中循环21次呢? I can easily get the names of my variables like this: 我可以像这样轻松获取变量名:

> grep(c("ratio"), names(DT))
[1] 3 4
> names(DT)[grep(c("ratio"), names(DT))]
[1] "ratio"  "ratio2"

So think a for (z in 1:length(namelist)) {} or something would work. 因此,考虑一个for (z in 1:length(namelist)) {}还是可以的。 But I'm not sure how to then reference those names (or numbers) within the data.table structure to recreate what I did above. 但是我不确定如何在data.table结构中引用这些名称(或数字)来重新创建我在上面所做的工作。

Going to long format... 要使用长格式...

mDT = melt(DT, meas=patterns("ratio"), value.name = "ratio")
setorder(mDT, variable, yr, ratio)
mDT[, dec := cut(.I, 10, labels = FALSE), by=.(yr, variable)]

mDT[, .(
  mval = mean(val), 
  mrat = mean(ratio), 
  wmval = weighted.mean(val, weight), 
  wmrat = weighted.mean(ratio, weight)
), keyby=.(variable, dec)]

    variable dec          mval     mrat        wmval    wmrat
 1:    ratio   1 -0.0279943849 3.576939 -0.039512050 3.572319
 2:    ratio   2 -0.0011460087 4.329835  0.005093692 4.331433
 3:    ratio   3 -0.0090873863 4.784103 -0.012764902 4.767494
 4:    ratio   4 -0.0149614666 5.094431 -0.015464918 5.110614
 5:    ratio   5  0.0147052939 5.373705  0.015276699 5.364962
 6:    ratio   6 -0.0101956297 5.645182 -0.014102394 5.618484
 7:    ratio   7  0.0012979528 5.949583 -0.012839401 5.925634
 8:    ratio   8 -0.0093009096 6.265297 -0.007141404 6.263371
 9:    ratio   9  0.0129705386 6.651047  0.018474949 6.684825
10:    ratio  10  0.0038414948 7.363449 -0.004225650 7.351828
11:   ratio2   1 -0.0120823787 1.195964 -0.016154551 1.199026
12:   ratio2   2 -0.0072534833 1.904354 -0.030409684 1.908494
13:   ratio2   3 -0.0283728080 2.282277 -0.028168936 2.301685
14:   ratio2   4 -0.0068901529 2.590815  0.002836866 2.585152
15:   ratio2   5 -0.0035769658 2.880104  0.002391468 2.872702
16:   ratio2   6  0.0087575593 3.147469  0.004565452 3.134459
17:   ratio2   7 -0.0052354409 3.412187 -0.005866282 3.426711
18:   ratio2   8  0.0123337036 3.704371  0.009488475 3.701694
19:   ratio2   9  0.0027419978 4.071582 -0.008958386 4.076264
20:   ratio2  10 -0.0002925368 4.786477  0.003691116 4.772209

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM