简体   繁体   English

你能解释这个by-group data.table结果吗?

[英]Can you explain this by-group data.table result

I must be doing something obviously stupid here but can someone explain why it appears like data.table is not doing the following by group operation 我必须在这里做一些明显愚蠢的事情,但有人可以解释为什么看起来像data.table没有通过组操作执行以下操作

set.seed(1)
DT = data.table(grp=c(rep('a',100),rep('b',100)), val=c(runif(100), rnorm(100)))
DT[grp=='a',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

          10%    20%    30%    40%    50%    60%    70%    80%    90%        
  -Inf 0.1415 0.2555 0.3448 0.4108 0.4878 0.6442 0.7140 0.7842 0.8703    Inf 

DT[grp=='b',c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]

              10%      20%      30%      40%      50%      60%      70%      80%      90%          
    -Inf -1.22751 -0.66000 -0.55036 -0.32170 -0.11762  0.06583  0.37427  0.69183  1.35196      Inf 

DT[,interval:=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)),.(grp)][]

     grp     val        interval
  1:   a  0.2655   (-0.66,-0.55] => this is a "b" interval ? I would expect (0.2555 0.3448]
  2:   a  0.3721  (-0.55,-0.322]
  3:   a  0.5729 (-0.118,0.0658]
  4:   a  0.9082     (1.35, Inf]
  5:   a  0.2017   (-1.23,-0.66]
 ---                            
196:   b -0.7508   (-1.23,-0.66]
197:   b  2.0872     (1.35, Inf]
198:   b  0.0174 (-0.118,0.0658]
199:   b -1.2863    (-Inf,-1.23]
200:   b -1.6406    (-Inf,-1.23]

I typically want to do things like : 我通常想做的事情如下:

DT[,mean(val),keyby=.(grp,interval=cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))]
    grp        interval          V1
 1:   a (-0.321,0.0379]  0.01836077  => this is not a "a" interval
 2:   a   (0.0379,0.21]  0.13190935
 3:   a    (0.21,0.358]  0.29068707
 4:   a   (0.358,0.477]  0.41647597
 5:   a   (0.477,0.648]  0.55190648
 6:   a   (0.648,0.777]  0.70883795
 7:   a   (0.777,0.915]  0.84091210
 8:   a    (0.915, Inf]  0.95797615
 9:   b   (-Inf,-0.657] -1.23322909
10:   b (-0.657,-0.321] -0.53243898
11:   b (-0.321,0.0379] -0.13968720
12:   b   (0.0379,0.21]  0.11278201
13:   b    (0.21,0.358]  0.30783459
14:   b   (0.358,0.477]  0.40695489
15:   b   (0.477,0.648]  0.55976052
16:   b   (0.648,0.777]  0.70483170
17:   b   (0.777,0.915]  0.91017423
18:   b    (0.915, Inf]  1.57112705

And this looks suspiciously like if the intervals were defined on the WHOLE dataset instead of the groups 这看起来很奇怪,如果间隔是在WHOLE数据集而不是组上定义的

DT[,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)]
                    10%         20%         30%         40%         50%         60%         70%         80%         90%             
       -Inf -0.65729223 -0.32084835  0.03788176  0.20967534  0.35835115  0.47738589  0.64820328  0.77734560  0.91505885         Inf 

It looks like you are expecting a fancy way of combining factor levels (which is what cut creates) across groups. 看起来你期待一种奇妙的方式来组合各组的因子水平(这是cut创造的)。 Instead, you found strange behavior, which is typical for factors. 相反,你发现了奇怪的行为,这是典型的因素。

I guess you could use strings: 我想你可以使用字符串:

DT[,interval := 
  as.character(cut(val,c(-Inf,quantile(val,probs=seq(.1,.9,.1)),Inf)))
, by=grp]

which gives 这使

     grp         val        interval
  1:   a  0.26550866   (0.256,0.345]
  2:   a  0.37212390   (0.345,0.411]
  3:   a  0.57285336   (0.488,0.644]
  4:   a  0.90820779     (0.87, Inf]
  5:   a  0.20168193   (0.142,0.256]
 ---                                
196:   b -0.75081900   (-1.23,-0.66]
197:   b  2.08716655     (1.35, Inf]
198:   b  0.01739562 (-0.118,0.0658]
199:   b -1.28630053    (-Inf,-1.23]
200:   b -1.64060553    (-Inf,-1.23]

These intervals are not good for much of anything, however. 然而,这些间隔对于任何东西都不好。 If you try to sort by them, like DT[, mean(val), keyby=.(grp, interval)] , you'll see that they are out of order. 如果你尝试按它们排序,比如DT[, mean(val), keyby=.(grp, interval)] ,你会发现它们没有问题。


If you just want these cuts around for a single calculation... 如果你只是希望这些削减进行一次计算......

mycut = function(x) cut(x,c(-Inf,quantile(x,probs=seq(.1,.9,.1)),Inf))

DT[,{
    .SD[, mean(val), keyby=.(interval=mycut(val))][, interval := as.character(interval)]
},keyby=grp]

which gives 这使

    grp        interval          V1
 1:   a    (-Inf,0.142]  0.07670249
 2:   a   (0.142,0.256]  0.20584852
 3:   a   (0.256,0.345]  0.30715649
 4:   a   (0.345,0.411]  0.38583465
 5:   a   (0.411,0.488]  0.45901975
 6:   a   (0.488,0.644]  0.56413855
 7:   a   (0.644,0.714]  0.67442643
 8:   a   (0.714,0.784]  0.75834958
 9:   a    (0.784,0.87]  0.82747749
10:   a     (0.87, Inf]  0.91951669
11:   b    (-Inf,-1.23] -1.54198329
12:   b   (-1.23,-0.66] -0.92447488
13:   b   (-0.66,-0.55] -0.61458549
14:   b  (-0.55,-0.322] -0.45029247
15:   b (-0.322,-0.118] -0.22533466
16:   b (-0.118,0.0658] -0.01587467
17:   b  (0.0658,0.374]  0.24836075
18:   b   (0.374,0.692]  0.53061032
19:   b    (0.692,1.35]  1.01688411
20:   b     (1.35, Inf]  1.80089535

Yeah, not very elegant, but I think this is a problem from R itself, and it's not obvious how it should change to address your case. 是的,不是很优雅,但我认为这是R本身的一个问题,并且它应该如何改变来解决你的情况并不明显。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM