简体   繁体   English

如何用group by计算百分位数?

[英]How to calculate percentile with group by?

I have a data.table with over ten thousand of rows and it looks like this:我有一个超过一万行的 data.table,它看起来像这样:

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

    ID result_2010 result_2011 years
 1:  1        TRUE       FALSE  15.0
 2:  2       FALSE        TRUE  16.5
 3:  3        TRUE       FALSE  31.0
 4:  4       FALSE       FALSE   1.0
 5:  5       FALSE       FALSE  40.2
 6:  6        TRUE       FALSE   0.3
 7:  7       FALSE        TRUE  12.0
 8:  8       FALSE       FALSE  22.7
 9:  9        TRUE       FALSE  19.0
10: 10       FALSE        TRUE  12.0

For "result_2010" and "result_2011" I want to make a percentile analysis of the "years" but only if the value in for the individual is "TRUE".对于“result_2010”和“result_2011”,我想对“年份”进行百分位分析,但前提是个人的值为“TRUE”。 The code that I tried seems to work, but it gives back the same results for "result_2010" and "result_2011", which is for sure incorrect:我尝试的代码似乎有效,但它为“result_2010”和“result_2011”返回了相同的结果,这肯定是不正确的:

DT1 %>%
  group_by(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  group_by(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

Could anyone help how to correct my code?任何人都可以帮助如何更正我的代码?

Using melt and aggregate .使用meltaggregate

library(data.table)
melt(DT1, c(1, 4), 2:3) |>
  transform(variable=substring(variable, 8)) |>
  subset(value == TRUE) |>
  with(aggregate(list(q=years), list(year=variable), \(x)
                 c(quantile(x), mean=mean(x))))
#   year   q.0%  q.25%  q.50%  q.75% q.100% q.mean
# 1 2010  0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500

Note: Please use R>=4.1 for the |> pipes and \\(x) function shorthand notation (or write function(x) ).注意:请使用R>=4.1作为|>管道和\\(x)函数简写符号(或写function(x) )。

You may write-up a function and run it on every result column.您可以编写一个函数并在每个result列上运行它。

library(tidyverse)

cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)

#  Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#1 result_2010   4.7  11.3    17  22    27.4     17  16.3
#2 result_2011  12    12      12  14.2  15.6     12  13.5

A melt / dcast option: melt / dcast选项:

library(data.table)
tmp <- melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, .(variable = c(paste0("q", c(10, 25, 50, 75, 90)), "mu"),
         value = c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                  mean(years)))
    , by = .(year)]
tmp
#            year variable  value
#          <fctr>   <char>  <num>
#  1: result_2010      q10  4.710
#  2: result_2010      q25 11.325
#  3: result_2010      q50 17.000
#  4: result_2010      q75 22.000
#  5: result_2010      q90 27.400
#  6: result_2010       mu 16.325
#  7: result_2011      q10 12.000
#  8: result_2011      q25 12.000
#  9: result_2011      q50 12.000
# 10: result_2011      q75 14.250
# 11: result_2011      q90 15.600
# 12: result_2011       mu 13.500

dcast(tmp, year ~ variable, value.var = "value")
#           year     mu   q10    q25   q50   q75   q90
#         <fctr>  <num> <num>  <num> <num> <num> <num>
# 1: result_2010 16.325  4.71 11.325    17 22.00  27.4
# 2: result_2011 13.500 12.00 12.000    12 14.25  15.6

You have complete control over the names, just assign then (in order) within the "variable" column (you might choose to name it better).您可以完全控制名称,只需在"variable"列中(按顺序)分配(您可以选择更好地命名)。

Or a solitary melt :或单独melt

melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, setNames(as.list(c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                          mean(years))),
                c(paste0("q", c(10, 25, 50, 75, 90)), "mu"))
    , by = .(year)][]
#           year   q10    q25   q50   q75   q90     mu
#         <fctr> <num>  <num> <num> <num> <num>  <num>
# 1: result_2010  4.71 11.325    17 22.00  27.4 16.325
# 2: result_2011 12.00 12.000    12 14.25  15.6 13.500

Names are again controlled easily, now in the 2nd argument of setNames .名称再次很容易控制,现在在setNames的第二个参数中。 The premise is that returning a named- list in data.table processing will convert it to named columns, so any function that does this is easily usable.前提是在data.table处理中返回命名list会将其转换为命名列,因此执行此操作的任何函数都很容易使用。

library(tidyverse)
DT1 <- tibble(ID = 1:10,
                  result_2010 = c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
                  result_2011 = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

tibble(
  year = c(2010, 2011),
  data = list(DT1$years[DT1$result_2010],
              DT1$years[DT1$result_2011])
) %>% group_by(year) %>% 
  group_modify(~fQuantMean(.x$data[[1]]))

output输出

# A tibble: 2 x 7
# Groups:   year [2]
   year  `0%` `25%` `50%` `75%` `100%`  mean
  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1  2010   0.3  11.3    17  22     31    16.3
2  2011  12    12      12  14.2   16.5  13.5

An update for anyone interested!任何有兴趣的人的更新!

Hello dear colleagues.亲爱的同事你好。 As you can see, each task can be solved in several different ways.如您所见,每个任务都可以通过几种不同的方式来解决。 So I decided to compare the methods proposed by here.所以我决定比较这里提出的方法。 Since @Gabesz mentioned that he has 10000 observations, I decided to check each of the solutions in terms of performance.由于@Gabesz 提到他有 10000 次观察,我决定在性能方面检查每个解决方案。

n=10000
set.seed(1234)
DT1 <- tibble(ID = 1:n,
              result_2010 = sample(c(TRUE, FALSE), n, replace = TRUE),
              result_2011 = sample(c(TRUE, FALSE), n, replace = TRUE),
              years = rnorm(n, 20, 5))

Then I did a little benchmark然后我做了一个小基准

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

fFiolka = function(){
  tibble(
    year = c(2010, 2011),
    data = list(DT1$years[DT1$result_2010],
                DT1$years[DT1$result_2011])
  ) %>% group_by(year) %>% 
    group_modify(~fQuantMean(.x$data[[1]]))
}
fFiolka()
# # A tibble: 2 x 7
# # Groups:   year [2]
#    year     `0%` `25%` `50%` `75%` `100%`  mean
#    <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1  2010 -0.00697  16.4  19.9  23.3   38.6  19.9
# 2  2011 -0.633    16.5  20.0  23.4   38.6  20.0

library(data.table)

fjay_sf = function(){
  melt(DT1, c(1, 4), 2:3) |>
    transform(variable=substring(variable, 8)) |>
    subset(value == TRUE) |>
    with(aggregate(list(q=years), list(year=variable), \(x)
                   c(quantile(x), mean=mean(x))))
}
fjay_sf()
# year         q.0%        q.25%        q.50%        q.75%       q.100%       q.mean
# 1 2010 -0.006968224 16.447077579 19.947385976 23.348571278 38.636456902 19.944574420
# 2 2011 -0.633138113 16.530534403 20.043636844 23.424378551 38.636456902 20.013130400
# Warning message:
#   In melt(DT1, c(1, 4), 2:3) :
#   The melt generic in data.table has been passed a tbl_df and will attempt to redirect 
#   to the relevant reshape2 method; please note that reshape2 is deprecated, and this 
#   redirection is now deprecated as well. To continue using melt methods from reshape2
#    while both libraries are attached, e.g. melt.list, you can prepend the namespace 
#    like reshape2::melt(DT1). In the next version, this warning will become an error.


cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

fShah = function(){
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)
}
fShah()
# # A tibble: 2 x 8
#   Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1 result_2010  13.5  16.4  19.9  23.3  26.4   19.9  19.9
# 2 result_2011  13.4  16.5  20    23.4  26.6   20    20  

library(microbenchmark)
ggplot2::autoplot(microbenchmark(fFiolka(), fjay_sf(), fShah(), times=100))

在此处输入图片说明

Hope the chart above explains it all.希望上面的图表能说明一切。

@r2evans please don't blame me for skipping your solution but it caused me some errors. @r2evans 请不要怪我跳过您的解决方案,但这给我带来了一些错误。

This will be my first answer, so please forgive me if I do something wrong.这将是我的第一个答案,所以如果我做错了什么,请原谅我。 By reading your question carefully, you wanted someone to help you improve your code.通过仔细阅读您的问题,您希望有人帮助您改进代码。 Here it is, please.在这里,请。

library(tidyverse)
library(data.table)

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
DT1 %>%
  filter(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  filter(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

In the first case, it returns the values 4.7, 11.3, 17, 22, 27.4, 16.3.在第一种情况下,它返回值 4.7、11.3、17、22、27.4、16.3。 In the second case, it returns 12, 12, 12, 14.2, 15.6, 13.5.在第二种情况下,它返回 12、12、12、14.2、15.6、13.5。 I see so many different answers here.我在这里看到了很多不同的答案。 Although I honestly admit some of them I don't understand (yet).虽然我老实承认其中一些我不明白(还)。 I really like the solution with quantile%>% tibble%>% bind_cols.我真的很喜欢 quantile%>% tibble%>% bind_cols 的解决方案。 But knock on what I have a low reputation for pointing to this as helpful.但是,我认为这很有帮助而声名狼藉。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM