简体   繁体   English

以多年增量计算中位数分组 R

[英]Calculate median grouping in multiple year increments R

I'm trying to use dplyr to calculate medians by grouping 3 different columns and in 3 year increments.我正在尝试使用 dplyr 通过分组 3 个不同的列并以 3 年为增量来计算中位数。

My data looks like this:我的数据如下所示:

data <- data.frame("Year" = c("1990","1990", "1992", "1993", "1994", "1990", "1991", "1990", 
"1991", "1992", "1994", "1995"),"Type" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al", "Cu", 
"Cu", "Cu", "Cu", "Cu"), "Frac" = c("F", "F", "F", "F", "F", "UF", "UF", "F", "F", "UF", 
"UF", "UF"), "Value" = c(0.1, 0.2, 0.3, 0.6, 0.7, 1.3, 1.5, 0.4, 0.2, 0.9, 2.3, 2.9))        

I would like to calculate the median of "Value" in 3 year groupings and also grouping by "Type" and "Frac".我想计算 3 年分组中“价值”的中位数,并按“类型”和“Frac”分组。

The problem is that sometimes there is a missing year, so I want it to group in 3 year increments based on the data that I have.问题是有时会丢失一年,所以我希望它根据我拥有的数据以 3 年为增量进行分组。 Showing what I mean with my example data it would be grouped like this: (1990, 1992, 1993) for Al and F. Then just (1994) for Al and F since there's no more data for Al and F. Then (1990, 1991) for Al and UF since there's only 2 years worth of data.用我的示例数据显示我的意思,它将按如下方式分组:(1990, 1992, 1993) 用于 Al 和 F。然后只是 (1994) 用于 Al 和 F,因为没有更多用于 Al 和 F 的数据。然后 (1990, 1991) 对于 Al 和 UF,因为只有 2 年的数据。 So basically I want it to be grouped by 3 years if possible, but if not, then do whatever is left over.所以基本上我希望它尽可能按 3 年分组,但如果没有,那么就做剩下的事情。

This is the end table I would like to have:这是我想要的茶几:

stats_wanted <- data.frame("Year" = c("1990, 1992, 1993", "1994", "1990, 1991", 
"1990, 1991", "1992, 1994, 1995"), "Type" = c("Al", "Al", "Al", "Cu", "Cu"), "Frac" = 
c("F", "F", "UF", "F", "UF"), "Median" = c(0.25, 0.7, 1.4, 0.3, 2.3))

Hopefully this makes sense... let me know if you have any questions :)!希望这是有道理的......如果您有任何问题,请告诉我:)!

I do not know dplyr, but here is a data.table solution.我不知道 dplyr,但这里有一个 data.table 解决方案。

library(data.table)
setDT(data)
data = data[order(Type,Frac,Year)]
# data = data[order(Year)] also works fine
data[
  !duplicated(.SD,by=c('Year','Type','Frac')),
  yeargroup:=0:(.N-1) %/% 3,
  .(Type,Frac)]
# !duplicated... selects only the first unique row by year,type,frac
# 0:(.N-1) gives 0 to N-1 for each Type,Frac group
# %/% 3 gives the remainder when divided by 3

> data
    Year Type Frac Value yeargroup
 1: 1990   Al    F   0.1         0
 2: 1990   Al    F   0.2        NA <- NA because dupe Year,Type,Frac
 3: 1992   Al    F   0.3         0
 4: 1993   Al    F   0.6         0
 5: 1994   Al    F   0.7         1
 6: 1990   Al   UF   1.3         0
 7: 1991   Al   UF   1.5         0
 8: 1990   Cu    F   0.4         0
 9: 1991   Cu    F   0.2         0
10: 1992   Cu   UF   0.9         0
11: 1994   Cu   UF   2.3         0
12: 1995   Cu   UF   2.9         0

# handle dupe Year,Type,Frac rows:
data[,yeargroup:=max(yeargroup,na.rm=T),.(Year,Type,Frac)]

> data
    Year Type Frac Value yeargroup
 1: 1990   Al    F   0.1         0
 2: 1990   Al    F   0.2         0 <- fixed NA
 3: 1992   Al    F   0.3         0
 4: 1993   Al    F   0.6         0
 5: 1994   Al    F   0.7         1
 6: 1990   Al   UF   1.3         0
 7: 1991   Al   UF   1.5         0
 8: 1990   Cu    F   0.4         0
 9: 1991   Cu    F   0.2         0
10: 1992   Cu   UF   0.9         0
11: 1994   Cu   UF   2.3         0
12: 1995   Cu   UF   2.9         0

stats_wanted = data[,
  .(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
  .(Type,Frac,yeargroup)]

> stats_wanted
   Type Frac yeargroup             Year Median
1:   Al    F         0 1990, 1992, 1993   0.25
2:   Al    F         1             1994   0.70
3:   Al   UF         0       1990, 1991   1.40
4:   Cu    F         0       1990, 1991   0.30
5:   Cu   UF         0 1992, 1994, 1995   2.30

PS: @ronak-shah posted a concise dplyr solution, which inspired me to post another data.table solution which is even conciser: PS:@ronak-shah 发布了一个简洁的 dplyr 解决方案,这启发了我发布另一个更简洁的 data.table 解决方案:

> data[
  order(Year),
  .(Year,Value,group=(rleid(Year)-1)%/%3),
  .(Type,Frac)
][,
  .(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
  .(Type,Frac,group)
]

Here's a dplyr solution -这是一个dplyr解决方案 -

For each Type and Frac , we create a group column which assigns the same number to every 3 values.对于每个TypeFrac ,我们创建一个group列,为每 3 个值分配相同的数字。 For each group, we concatenate the Year value and calculate the median .对于每个组,我们连接Year值并计算中median

library(dplyr)

data %>%
  group_by(Type, Frac) %>%
  mutate(group = match(Year, unique(Year)), 
         group = ceiling(group/3)) %>%
  group_by(group, .add = TRUE) %>%
  summarise(Year = toString(unique(Year)), 
            Median = median(Value), .groups = 'drop') %>%
  select(Year, Type, Frac, Median)

#  Year             Type  Frac  Median
#  <chr>            <chr> <chr>  <dbl>
#1 1990, 1992, 1993 Al    F       0.25
#2 1994             Al    F       0.7 
#3 1990, 1991       Al    UF      1.4 
#4 1990, 1991       Cu    F       0.3 
#5 1992, 1994, 1995 Cu    UF      2.3 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM