[英]Percentile by 2 conditions on R
我有以下具有3個變量和幾個觀察結果的數據框
data <- read.table(text="
YEAR SECTOR VALUE
2016 A 2
2016 A 5
2016 A 10
2016 A 20
2016 A 50
2016 A 100
2016 A 200
2016 A 300
2016 B 20
2016 B 50
2016 B 100
2016 B 200
2016 B 500
2016 B 1000
2016 B 2000
2016 B 3000
2017 A 21
2017 A 51
2017 A 101
2017 A 201
2017 A 501
2017 A 1001
2017 A 2001
2017 A 3001
2017 B 201
2017 B 501
2017 B 1001
2017 B 2001
2017 B 5001
2016 B 10001
2017 B 20001
2017 B 30001",
header=TRUE)
我想計算每個內的第一四分位數,中位數和第三個四分位數YEAR
+ SECTOR
為insance,的第一四分位數Sector
A
和YEAR
2016
將返回5
基於(2,5,10,20,50,100,200,300)
一個選擇是按“YEAR”,“部門”,子集存儲fivenum
在tibble
, unnest
然后spread
它“寬”格式
library(dplyr)
library(tidyr)
df1 %>%
group_by(YEAR, SECTOR) %>%
group_map(~ .x %>%
summarise(val = list(tibble(categ = c('1st quart', 'median', '3rd quart'),
val = fivenum(VALUE)[2:4])))) %>%
unnest %>%
spread(categ, val)
# A tibble: 4 x 5
# Groups: YEAR, SECTOR [4]
# YEAR SECTOR `1st quart` `3rd quart` median
# <int> <chr> <dbl> <dbl> <dbl>
#1 2016 A 7.5 150 35
#2 2016 B 100 2000 500
#3 2017 A 76 1501 351
#4 2017 B 751 12501 2001
df1 <- structure(list(YEAR = c(2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2016L, 2017L, 2017L), SECTOR = c("A",
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
"B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "B", "B"), VALUE = c(2L, 5L, 10L, 20L, 50L, 100L,
200L, 300L, 20L, 50L, 100L, 200L, 500L, 1000L, 2000L, 3000L,
21L, 51L, 101L, 201L, 501L, 1001L, 2001L, 3001L, 201L, 501L,
1001L, 2001L, 5001L, 10001L, 20001L, 30001L)), class = "data.frame",
row.names = c(NA,
-32L))
這個怎么樣:
library(dplyr)
data %>%
group_by(SECTOR,YEAR) %>%
summarise(median = summary(VALUE)[3],
q1 = summary(VALUE)[2],
q3 = summary(VALUE)[5])
但是,根據summary()
,您提供的示例的第一個分位數應為8.75
probs = c(0.25, 0.5, 0.75)
ans = Reduce(function(x1, x2) merge(x1, x2, by = c("YEAR", "SECTOR")),
lapply(probs, function(p)
aggregate(x = setNames(list(df1$VALUE), paste0("Q_",p)),
by = df1[c("YEAR", "SECTOR")],
FUN = function(x) quantile(x, probs = p))))
ans
# YEAR SECTOR Q_0.25 Q_0.5 Q_0.75
#1 2016 A 8.75 35 125
#2 2016 B 100.00 500 2000
#3 2017 A 88.50 351 1251
#4 2017 B 751.00 2001 12501
另一種方法是使用quantile()
函數和dplyr
:
library(dplyr)
data %>%
group_by(SECTOR, YEAR) %>%
summarize(q1 = quantile(VALUE)[1],
median = quantile(VALUE)[2],
q3 = quantile(VALUE)[3])
## SECTOR YEAR q1 median med q3
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 2016 2 8.75 35 35
## 2 A 2017 21 88.5 351 351
## 3 B 2016 20 100 500 500
## 4 B 2017 201 751 2001 2001
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.