简体   繁体   English

计算 90 个百分位数并用 R 中的组中位数替换它

[英]calculation of 90 percentile and replacement of it by median by groups in R

Here part of data.这里是部分数据。

mydat=structure(list(code = c(123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 
123L, 123L, 123L, 123L, 123L, 123L, 123L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 
222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L, 222L), 
    item = c(234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 
    234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 
    333L, 333L, 333L, 333L, 333L, 333L), return = c(25L, 25L, 
    21L, 37L, 23L, 27L, 19L, 7L, 16L, 12L, 33L, 24L, 6L, 14L, 
    4L, 25L, 90L, 27L, 3L, 16L, 7L, 1L, 13L, 11L, 36L, 5L, 6L, 
    14L, 11L, 41L, 11L, 6L, 4L, 11L, 3L, 6L, 21L, 41L, 28L, 30L, 
    92L, 4L, 1L, 83L, 3L, 16L, 4L, 25L, 25L, 21L, 37L, 23L, 27L, 
    19L, 7L, 16L, 12L, 33L, 24L, 6L, 14L, 4L, 25L, 90L, 27L, 
    3L, 16L, 7L, 1L, 13L, 11L, 36L, 5L, 6L, 14L, 11L, 41L, 11L, 
    6L, 4L, 11L, 3L, 6L, 21L, 41L, 28L, 30L, 92L, 4L, 1L, 83L, 
    3L, 16L, 4L), action = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("code", 
"item", "return", "action"), class = "data.frame", row.names = c(NA, 
-94L))

I have 2 group vars code+item.我有 2 组变量代码+项目。 Here two groups:这里有两组:

123 234
222 333

Also i have action column.我也有行动专栏。 It can have only two values(category) zero(0) or one(1).它只能有两个值(类别)零(0)或一(1)。

i need calculate 90 percentile by zero category of action of return column, which go before one category of action.我需要通过返回列的零类动作来计算90 percentile ,它在一类动作之前。 Then i need calculate the median by zero category of action of return column, which go before one category of action.然后我需要通过返回列的零类动作来计算中median ,它在一类动作之前。 (After one we don't touch zeros) (一后我们不碰零)

These statistics must be calculate by 14 zeros preceding the one category of action这些统计数据必须由一类动作之前的 14 个零计算

Then i have to find values that more 90 percentile which was calculated above, then such values must be replce by the median which was calculated.然后我必须找到上面计算的超过 90 个百分位数的值,然后这些值必须由计算的中位数代替。

After one category of action go zero category again for return column.一类动作后,返回列再次归零类。 For it i too have to find value that more 90 percentile which was calculeted above, then such value must be replaced by the median which was calculeted above(when calculate for 14 zeros).为此,我也必须找到上面计算的更多 90 个百分位数的值,然后必须用上面计算的中位数替换该值(计算 14 个零时)。

Note the calculation is done by 14 zeros preceding the one category of action but replacing by median is done for all zero category of action and performing for each groups code+item请注意, calculation是由一类动作之前的 14 个零完成的,但replacing为中位数是针对所有zero category动作完成的,并为每个组code+item

The result can be in output column.结果可以在输出列中。

to be more clear here desired output.在这里更清楚所需的输出。

for 123+234 group the 90 perc=39,5 median=12对于123+234组,90 perc=39,5 中位数=12

for 222+333 the 90 perc=39,5 median=12对于222+333 333,90 perc=39,5 中位数=12

 code item return action output
1   123  234     25      0     25
2   123  234     25      0     25
3   123  234     21      0     21
4   123  234     37      0     16
5   123  234     23      0     23
6   123  234     27      0     27
7   123  234     19      0     19
8   123  234      7      0      7
9   123  234     16      0     16
10  123  234     12      0     12
11  123  234     33      0     33
12  123  234     24      0     24
13  123  234      6      0      6
14  123  234     14      0     14
15  123  234      4      0      4
16  123  234     25      0     25
17  123  234     90      0     **12**
18  123  234     27      0     27
19  123  234      3      0      3
20  123  234     16      0     16
21  123  234      7      0      7
22  123  234      1      0      1
23  123  234     13      0     13
24  123  234     11      0     11
25  123  234     36      0     36
26  123  234      5      0      5
27  123  234      6      0      6
28  123  234     14      0     14
29  123  234     11      0     11
30  123  234     41      0     16
31  123  234     11      1     Na
32  123  234      6      1     Na
33  123  234      4      1     Na
34  123  234     11      1     Na
35  123  234      3      0      3
36  123  234      6      0      6
37  123  234     21      0     21
38  123  234     41      0     **12**
39  123  234     28      0     28
40  123  234     30      0     30
41  123  234     92      0     **12**
42  123  234      4      0      4
43  123  234      1      0      1
44  123  234     83      0     **12**
45  123  234      3      0      3
46  123  234     16      0     16
47  123  234      4      0      4
48  222  333     25      0     25
49  222  333     25      0     25
50  222  333     21      0     21
51  222  333     37      0     16
52  222  333     23      0     23
53  222  333     27      0     27
54  222  333     19      0     19
55  222  333      7      0      7
56  222  333     16      0     16
57  222  333     12      0     12
58  222  333     33      0     33
59  222  333     24      0     24
60  222  333      6      0      6
61  222  333     14      0     14
62  222  333      4      0      4
63  222  333     25      0     25
64  222  333     90      0     **12**
65  222  333     27      0     27
66  222  333      3      0      3
67  222  333     16      0     16
68  222  333      7      0      7
69  222  333      1      0      1
70  222  333     13      0     13
71  222  333     11      0     11
72  222  333     36      0     36
73  222  333      5      0      5
74  222  333      6      0      6
75  222  333     14      0     14
76  222  333     11      0     11
77  222  333     41      0     16
78  222  333     11      1     Na
79  222  333      6      1     Na
80  222  333      4      1     Na
81  222  333     11      1     Na
82  222  333      3      0      3
83  222  333      6      0      6
84  222  333     21      0     21
85  222  333     41      0     **12**
86  222  333     28      0     28
87  222  333     30      0     30
88  222  333     92      0     **12**
89  222  333      4      0      4
90  222  333      1      0      1
91  222  333     83      0     **12**
92  222  333      3      0      3
93  222  333     16      0     16
94  222  333      4      0      4

** i marked rows where value was replced by median. ** 我标记了值被中位数替换的行。

With tidyverse :使用tidyverse

 mydat%>%
  group_by(code,item)%>%
  mutate(output=ifelse(return>quantile(return,.9) & action==0,median(return),return))

If I understand correctly, the OP wants to如果我理解正确,OP 想要

  1. compute the 90% quantile and the median for the last 14 "zero action rows" (ie, rows with action == 0 ) before the first row with action == 1 , separately for each code , item group.计算90%分位点和中值的最后14“零点行动的行”(即,与行action == 0 )的第一行与action == 1 ,分别对每个codeitem组。 (This implies that there is only one streak of action == 1 values per group. (这意味着每组只有一个连续action == 1值。
  2. Copy the return values to the output for all rows with action == 0 .action == 0所有行的return值复制到output中。
  3. Replace the output value if it is larger than the 90% quantile by the median in all zero action rows before the first action == 1 row in each group.如果output值大于 90% 分位数,则替换为每组中第一个action == 1行之前所有零操作行中的中位数。

This can be solved by updating in a non-equi join with some preparations这可以通过在非对等连接中更新并进行一些准备来解决

library(data.table)
# mark the zero acton rows before the the action period
setDT(mydat)[, zero_before := cummax(action), by = .(code, item)]
# compute median and 90% quantile for that last 14 rows before each action period 
agg <- mydat[zero_before == 0, 
             quantile(tail(return, 14L), c(0.5, 0.9)) %>% 
               as.list()  %>% 
               set_names(c("med", "q90")) %>% 
               c(.(zero_before = 0)), by = .(code, item)]
agg

   code item med  q90 zero_before
1:  123  234  12 39.5           0
2:  222  333  12 39.5           0

# append output column
mydat[action == 0, output := as.double(return)][
  # replace output values greater q90 in an update non-equi join
  agg, on = .(code, item, action, return > q90), output := as.double(med)][
    # remove helper column
    , zero_before := NULL]

As mydat has 94 rows, we will show only the updated rows:由于mydat有 94 行,我们将只显示更新的行:

mydat[return != output]
 code item return action output 1: 123 234 90 0 12 2: 123 234 41 0 12 3: 123 234 41 0 12 4: 123 234 92 0 12 5: 123 234 83 0 12 6: 222 333 90 0 12 7: 222 333 41 0 12 8: 222 333 41 0 12 9: 222 333 92 0 12 10: 222 333 83 0 12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM