[英]Percentile by 2 conditions on R
I have the following dataframe with 3 variables and several observations 我有以下具有3个变量和几个观察结果的数据框
data <- read.table(text="
YEAR SECTOR VALUE
2016 A 2
2016 A 5
2016 A 10
2016 A 20
2016 A 50
2016 A 100
2016 A 200
2016 A 300
2016 B 20
2016 B 50
2016 B 100
2016 B 200
2016 B 500
2016 B 1000
2016 B 2000
2016 B 3000
2017 A 21
2017 A 51
2017 A 101
2017 A 201
2017 A 501
2017 A 1001
2017 A 2001
2017 A 3001
2017 B 201
2017 B 501
2017 B 1001
2017 B 2001
2017 B 5001
2016 B 10001
2017 B 20001
2017 B 30001",
header=TRUE)
I would like to calculate the 1st quartile, median and 3rd quartile within each YEAR
+ SECTOR
for insance, the 1st quartile of Sector
A
and YEAR
2016
would return 5
as based on (2,5,10,20,50,100,200,300)
. 我想计算每个内的第一四分位数,中位数和第三个四分位数
YEAR
+ SECTOR
为insance,的第一四分位数Sector
A
和YEAR
2016
将返回5
基于(2,5,10,20,50,100,200,300)
One option would be to group by 'YEAR', 'SECTOR', store the subset of fivenum
in a tibble
, unnest
and then spread
it to 'wide' format 一个选择是按“YEAR”,“部门”,子集存储
fivenum
在tibble
, unnest
然后spread
它“宽”格式
library(dplyr)
library(tidyr)
df1 %>%
group_by(YEAR, SECTOR) %>%
group_map(~ .x %>%
summarise(val = list(tibble(categ = c('1st quart', 'median', '3rd quart'),
val = fivenum(VALUE)[2:4])))) %>%
unnest %>%
spread(categ, val)
# A tibble: 4 x 5
# Groups: YEAR, SECTOR [4]
# YEAR SECTOR `1st quart` `3rd quart` median
# <int> <chr> <dbl> <dbl> <dbl>
#1 2016 A 7.5 150 35
#2 2016 B 100 2000 500
#3 2017 A 76 1501 351
#4 2017 B 751 12501 2001
df1 <- structure(list(YEAR = c(2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2016L, 2017L, 2017L), SECTOR = c("A",
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
"B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "B", "B"), VALUE = c(2L, 5L, 10L, 20L, 50L, 100L,
200L, 300L, 20L, 50L, 100L, 200L, 500L, 1000L, 2000L, 3000L,
21L, 51L, 101L, 201L, 501L, 1001L, 2001L, 3001L, 201L, 501L,
1001L, 2001L, 5001L, 10001L, 20001L, 30001L)), class = "data.frame",
row.names = c(NA,
-32L))
How about this: 这个怎么样:
library(dplyr)
data %>%
group_by(SECTOR,YEAR) %>%
summarise(median = summary(VALUE)[3],
q1 = summary(VALUE)[2],
q3 = summary(VALUE)[5])
However, according to summary()
, the first quantile for the example you provided should be 8.75 但是,根据
summary()
,您提供的示例的第一个分位数应为8.75
probs = c(0.25, 0.5, 0.75)
ans = Reduce(function(x1, x2) merge(x1, x2, by = c("YEAR", "SECTOR")),
lapply(probs, function(p)
aggregate(x = setNames(list(df1$VALUE), paste0("Q_",p)),
by = df1[c("YEAR", "SECTOR")],
FUN = function(x) quantile(x, probs = p))))
ans
# YEAR SECTOR Q_0.25 Q_0.5 Q_0.75
#1 2016 A 8.75 35 125
#2 2016 B 100.00 500 2000
#3 2017 A 88.50 351 1251
#4 2017 B 751.00 2001 12501
Another method is using the quantile()
function and dplyr
: 另一种方法是使用
quantile()
函数和dplyr
:
library(dplyr)
data %>%
group_by(SECTOR, YEAR) %>%
summarize(q1 = quantile(VALUE)[1],
median = quantile(VALUE)[2],
q3 = quantile(VALUE)[3])
## SECTOR YEAR q1 median med q3
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 2016 2 8.75 35 35
## 2 A 2017 21 88.5 351 351
## 3 B 2016 20 100 500 500
## 4 B 2017 201 751 2001 2001
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.