简体   繁体   English

R上2个条件的百分位数

[英]Percentile by 2 conditions on R

I have the following dataframe with 3 variables and several observations 我有以下具有3个变量和几个观察结果的数据框

    data <- read.table(text="
YEAR SECTOR VALUE
2016   A      2
2016   A      5
2016   A      10
2016   A      20
2016   A      50
2016   A     100
2016   A     200
2016   A     300
2016   B      20
2016   B      50
2016   B      100
2016   B      200
2016   B      500
2016   B     1000
2016   B     2000
2016   B     3000
2017   A      21
2017   A      51
2017   A      101
2017   A      201
2017   A      501
2017   A     1001
2017   A     2001
2017   A     3001
2017   B      201
2017   B      501
2017   B      1001
2017   B      2001
2017   B      5001
2016   B     10001
2017   B     20001
2017   B     30001", 
               header=TRUE)

I would like to calculate the 1st quartile, median and 3rd quartile within each YEAR + SECTOR for insance, the 1st quartile of Sector A and YEAR 2016 would return 5 as based on (2,5,10,20,50,100,200,300) . 我想计算每个内的第一四分位数,中位数和第三个四分位数YEAR + SECTOR为insance,的第一四分位数Sector AYEAR 2016将返回5基于(2,5,10,20,50,100,200,300)

One option would be to group by 'YEAR', 'SECTOR', store the subset of fivenum in a tibble , unnest and then spread it to 'wide' format 一个选择是按“YEAR”,“部门”,子集存储fivenumtibbleunnest然后spread它“宽”格式

library(dplyr)
library(tidyr)
df1 %>%
  group_by(YEAR, SECTOR) %>% 
  group_map(~ .x %>% 
       summarise(val = list(tibble(categ  = c('1st quart', 'median', '3rd quart'), 
            val = fivenum(VALUE)[2:4])))) %>% 
  unnest %>%
  spread(categ, val)
# A tibble: 4 x 5
# Groups:   YEAR, SECTOR [4]
#   YEAR SECTOR `1st quart` `3rd quart` median
#  <int> <chr>        <dbl>       <dbl>  <dbl>
#1  2016 A              7.5         150     35
#2  2016 B            100          2000    500
#3  2017 A             76          1501    351
#4  2017 B            751         12501   2001

data 数据

df1 <- structure(list(YEAR = c(2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2016L, 2017L, 2017L), SECTOR = c("A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "B", "B", "B"), VALUE = c(2L, 5L, 10L, 20L, 50L, 100L, 
200L, 300L, 20L, 50L, 100L, 200L, 500L, 1000L, 2000L, 3000L, 
21L, 51L, 101L, 201L, 501L, 1001L, 2001L, 3001L, 201L, 501L, 
1001L, 2001L, 5001L, 10001L, 20001L, 30001L)), class = "data.frame",
row.names = c(NA, 
-32L))

How about this: 这个怎么样:

library(dplyr)
data %>% 
  group_by(SECTOR,YEAR) %>% 
  summarise(median = summary(VALUE)[3],
            q1 = summary(VALUE)[2],
            q3 = summary(VALUE)[5])

However, according to summary() , the first quantile for the example you provided should be 8.75 但是,根据summary() ,您提供的示例的第一个分位数应为8.75

probs = c(0.25, 0.5, 0.75)
ans = Reduce(function(x1, x2) merge(x1, x2, by = c("YEAR", "SECTOR")),
             lapply(probs, function(p)
                 aggregate(x = setNames(list(df1$VALUE), paste0("Q_",p)),
                           by = df1[c("YEAR", "SECTOR")],
                           FUN = function(x) quantile(x, probs = p))))
ans
#  YEAR SECTOR Q_0.25 Q_0.5 Q_0.75
#1 2016      A   8.75    35    125
#2 2016      B 100.00   500   2000
#3 2017      A  88.50   351   1251
#4 2017      B 751.00  2001  12501

Another method is using the quantile() function and dplyr : 另一种方法是使用quantile()函数和dplyr

library(dplyr)

data %>% 
  group_by(SECTOR, YEAR) %>% 
  summarize(q1 = quantile(VALUE)[1], 
            median = quantile(VALUE)[2], 
            q3 = quantile(VALUE)[3])

##   SECTOR  YEAR    q1 median   med    q3
##   <fct>  <int> <dbl>  <dbl> <dbl> <dbl>
## 1 A       2016     2   8.75    35    35
## 2 A       2017    21  88.5    351   351
## 3 B       2016    20 100      500   500
## 4 B       2017   201 751     2001  2001

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM