[英]Calculate median grouping in multiple year increments R
I'm trying to use dplyr to calculate medians by grouping 3 different columns and in 3 year increments.我正在尝试使用 dplyr 通过分组 3 个不同的列并以 3 年为增量来计算中位数。
My data looks like this:我的数据如下所示:
data <- data.frame("Year" = c("1990","1990", "1992", "1993", "1994", "1990", "1991", "1990",
"1991", "1992", "1994", "1995"),"Type" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al", "Cu",
"Cu", "Cu", "Cu", "Cu"), "Frac" = c("F", "F", "F", "F", "F", "UF", "UF", "F", "F", "UF",
"UF", "UF"), "Value" = c(0.1, 0.2, 0.3, 0.6, 0.7, 1.3, 1.5, 0.4, 0.2, 0.9, 2.3, 2.9))
I would like to calculate the median of "Value" in 3 year groupings and also grouping by "Type" and "Frac".我想计算 3 年分组中“价值”的中位数,并按“类型”和“Frac”分组。
The problem is that sometimes there is a missing year, so I want it to group in 3 year increments based on the data that I have.问题是有时会丢失一年,所以我希望它根据我拥有的数据以 3 年为增量进行分组。 Showing what I mean with my example data it would be grouped like this: (1990, 1992, 1993) for Al and F. Then just (1994) for Al and F since there's no more data for Al and F. Then (1990, 1991) for Al and UF since there's only 2 years worth of data.
用我的示例数据显示我的意思,它将按如下方式分组:(1990, 1992, 1993) 用于 Al 和 F。然后只是 (1994) 用于 Al 和 F,因为没有更多用于 Al 和 F 的数据。然后 (1990, 1991) 对于 Al 和 UF,因为只有 2 年的数据。 So basically I want it to be grouped by 3 years if possible, but if not, then do whatever is left over.
所以基本上我希望它尽可能按 3 年分组,但如果没有,那么就做剩下的事情。
This is the end table I would like to have:这是我想要的茶几:
stats_wanted <- data.frame("Year" = c("1990, 1992, 1993", "1994", "1990, 1991",
"1990, 1991", "1992, 1994, 1995"), "Type" = c("Al", "Al", "Al", "Cu", "Cu"), "Frac" =
c("F", "F", "UF", "F", "UF"), "Median" = c(0.25, 0.7, 1.4, 0.3, 2.3))
Hopefully this makes sense... let me know if you have any questions :)!希望这是有道理的......如果您有任何问题,请告诉我:)!
I do not know dplyr, but here is a data.table solution.我不知道 dplyr,但这里有一个 data.table 解决方案。
library(data.table)
setDT(data)
data = data[order(Type,Frac,Year)]
# data = data[order(Year)] also works fine
data[
!duplicated(.SD,by=c('Year','Type','Frac')),
yeargroup:=0:(.N-1) %/% 3,
.(Type,Frac)]
# !duplicated... selects only the first unique row by year,type,frac
# 0:(.N-1) gives 0 to N-1 for each Type,Frac group
# %/% 3 gives the remainder when divided by 3
> data
Year Type Frac Value yeargroup
1: 1990 Al F 0.1 0
2: 1990 Al F 0.2 NA <- NA because dupe Year,Type,Frac
3: 1992 Al F 0.3 0
4: 1993 Al F 0.6 0
5: 1994 Al F 0.7 1
6: 1990 Al UF 1.3 0
7: 1991 Al UF 1.5 0
8: 1990 Cu F 0.4 0
9: 1991 Cu F 0.2 0
10: 1992 Cu UF 0.9 0
11: 1994 Cu UF 2.3 0
12: 1995 Cu UF 2.9 0
# handle dupe Year,Type,Frac rows:
data[,yeargroup:=max(yeargroup,na.rm=T),.(Year,Type,Frac)]
> data
Year Type Frac Value yeargroup
1: 1990 Al F 0.1 0
2: 1990 Al F 0.2 0 <- fixed NA
3: 1992 Al F 0.3 0
4: 1993 Al F 0.6 0
5: 1994 Al F 0.7 1
6: 1990 Al UF 1.3 0
7: 1991 Al UF 1.5 0
8: 1990 Cu F 0.4 0
9: 1991 Cu F 0.2 0
10: 1992 Cu UF 0.9 0
11: 1994 Cu UF 2.3 0
12: 1995 Cu UF 2.9 0
stats_wanted = data[,
.(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
.(Type,Frac,yeargroup)]
> stats_wanted
Type Frac yeargroup Year Median
1: Al F 0 1990, 1992, 1993 0.25
2: Al F 1 1994 0.70
3: Al UF 0 1990, 1991 1.40
4: Cu F 0 1990, 1991 0.30
5: Cu UF 0 1992, 1994, 1995 2.30
PS: @ronak-shah posted a concise dplyr solution, which inspired me to post another data.table solution which is even conciser: PS:@ronak-shah 发布了一个简洁的 dplyr 解决方案,这启发了我发布另一个更简洁的 data.table 解决方案:
> data[
order(Year),
.(Year,Value,group=(rleid(Year)-1)%/%3),
.(Type,Frac)
][,
.(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
.(Type,Frac,group)
]
Here's a dplyr
solution -这是一个
dplyr
解决方案 -
For each Type
and Frac
, we create a group
column which assigns the same number to every 3 values.对于每个
Type
和Frac
,我们创建一个group
列,为每 3 个值分配相同的数字。 For each group, we concatenate the Year
value and calculate the median
.对于每个组,我们连接
Year
值并计算中median
。
library(dplyr)
data %>%
group_by(Type, Frac) %>%
mutate(group = match(Year, unique(Year)),
group = ceiling(group/3)) %>%
group_by(group, .add = TRUE) %>%
summarise(Year = toString(unique(Year)),
Median = median(Value), .groups = 'drop') %>%
select(Year, Type, Frac, Median)
# Year Type Frac Median
# <chr> <chr> <chr> <dbl>
#1 1990, 1992, 1993 Al F 0.25
#2 1994 Al F 0.7
#3 1990, 1991 Al UF 1.4
#4 1990, 1991 Cu F 0.3
#5 1992, 1994, 1995 Cu UF 2.3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.