[英]How to calculate percentile with group by?
I have a data.table with over ten thousand of rows and it looks like this:我有一个超过一万行的 data.table,它看起来像这样:
DT1 <- data.table(ID = 1:10,
result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
ID result_2010 result_2011 years
1: 1 TRUE FALSE 15.0
2: 2 FALSE TRUE 16.5
3: 3 TRUE FALSE 31.0
4: 4 FALSE FALSE 1.0
5: 5 FALSE FALSE 40.2
6: 6 TRUE FALSE 0.3
7: 7 FALSE TRUE 12.0
8: 8 FALSE FALSE 22.7
9: 9 TRUE FALSE 19.0
10: 10 FALSE TRUE 12.0
For "result_2010" and "result_2011" I want to make a percentile analysis of the "years" but only if the value in for the individual is "TRUE".对于“result_2010”和“result_2011”,我想对“年份”进行百分位分析,但前提是个人的值为“TRUE”。 The code that I tried seems to work, but it gives back the same results for "result_2010" and "result_2011", which is for sure incorrect:
我尝试的代码似乎有效,但它为“result_2010”和“result_2011”返回了相同的结果,这肯定是不正确的:
DT1 %>%
group_by(result_2010 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
DT1 %>%
group_by(result_2011 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
Could anyone help how to correct my code?任何人都可以帮助如何更正我的代码?
Using melt
and aggregate
.使用
melt
和aggregate
。
library(data.table)
melt(DT1, c(1, 4), 2:3) |>
transform(variable=substring(variable, 8)) |>
subset(value == TRUE) |>
with(aggregate(list(q=years), list(year=variable), \(x)
c(quantile(x), mean=mean(x))))
# year q.0% q.25% q.50% q.75% q.100% q.mean
# 1 2010 0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500
Note: Please use R>=4.1 for the |>
pipes and \\(x)
function shorthand notation (or write function(x)
).注意:请使用R>=4.1作为
|>
管道和\\(x)
函数简写符号(或写function(x)
)。
You may write-up a function and run it on every result
column.您可以编写一个函数并在每个
result
列上运行它。
library(tidyverse)
cols <- grep('result_', names(DT1), value = TRUE)
get_stats_fun <- function(DT, col) {
DT %>%
filter(.data[[col]] == "TRUE") %>%
summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
"median" = round(median(years), 1),
"Mean" = round(mean(years),1)) %>%
unnest_wider(quantile)
}
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
mutate(Year = cols)
# Year `10%` `25%` `50%` `75%` `90%` median Mean
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 result_2010 4.7 11.3 17 22 27.4 17 16.3
#2 result_2011 12 12 12 14.2 15.6 12 13.5
A melt
/ dcast
option: melt
/ dcast
选项:
library(data.table)
tmp <- melt(DT1, c("ID", "years"), variable.name = "year"
)[ value == "TRUE",
][, .(variable = c(paste0("q", c(10, 25, 50, 75, 90)), "mu"),
value = c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)),
mean(years)))
, by = .(year)]
tmp
# year variable value
# <fctr> <char> <num>
# 1: result_2010 q10 4.710
# 2: result_2010 q25 11.325
# 3: result_2010 q50 17.000
# 4: result_2010 q75 22.000
# 5: result_2010 q90 27.400
# 6: result_2010 mu 16.325
# 7: result_2011 q10 12.000
# 8: result_2011 q25 12.000
# 9: result_2011 q50 12.000
# 10: result_2011 q75 14.250
# 11: result_2011 q90 15.600
# 12: result_2011 mu 13.500
dcast(tmp, year ~ variable, value.var = "value")
# year mu q10 q25 q50 q75 q90
# <fctr> <num> <num> <num> <num> <num> <num>
# 1: result_2010 16.325 4.71 11.325 17 22.00 27.4
# 2: result_2011 13.500 12.00 12.000 12 14.25 15.6
You have complete control over the names, just assign then (in order) within the "variable"
column (you might choose to name it better).您可以完全控制名称,只需在
"variable"
列中(按顺序)分配(您可以选择更好地命名)。
Or a solitary melt
:或单独
melt
:
melt(DT1, c("ID", "years"), variable.name = "year"
)[ value == "TRUE",
][, setNames(as.list(c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)),
mean(years))),
c(paste0("q", c(10, 25, 50, 75, 90)), "mu"))
, by = .(year)][]
# year q10 q25 q50 q75 q90 mu
# <fctr> <num> <num> <num> <num> <num> <num>
# 1: result_2010 4.71 11.325 17 22.00 27.4 16.325
# 2: result_2011 12.00 12.000 12 14.25 15.6 13.500
Names are again controlled easily, now in the 2nd argument of setNames
.名称再次很容易控制,现在在
setNames
的第二个参数中。 The premise is that returning a named- list
in data.table
processing will convert it to named columns, so any function that does this is easily usable.前提是在
data.table
处理中返回命名list
会将其转换为命名列,因此执行此操作的任何函数都很容易使用。
library(tidyverse)
DT1 <- tibble(ID = 1:10,
result_2010 = c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
result_2011 = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
fQuantMean = function(x) t(quantile(x)) %>%
as_tibble() %>% bind_cols(mean = mean(x))
tibble(
year = c(2010, 2011),
data = list(DT1$years[DT1$result_2010],
DT1$years[DT1$result_2011])
) %>% group_by(year) %>%
group_modify(~fQuantMean(.x$data[[1]]))
output输出
# A tibble: 2 x 7
# Groups: year [2]
year `0%` `25%` `50%` `75%` `100%` mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2010 0.3 11.3 17 22 31 16.3
2 2011 12 12 12 14.2 16.5 13.5
An update for anyone interested!任何有兴趣的人的更新!
Hello dear colleagues.亲爱的同事你好。 As you can see, each task can be solved in several different ways.
如您所见,每个任务都可以通过几种不同的方式来解决。 So I decided to compare the methods proposed by here.
所以我决定比较这里提出的方法。 Since @Gabesz mentioned that he has 10000 observations, I decided to check each of the solutions in terms of performance.
由于@Gabesz 提到他有 10000 次观察,我决定在性能方面检查每个解决方案。
n=10000
set.seed(1234)
DT1 <- tibble(ID = 1:n,
result_2010 = sample(c(TRUE, FALSE), n, replace = TRUE),
result_2011 = sample(c(TRUE, FALSE), n, replace = TRUE),
years = rnorm(n, 20, 5))
Then I did a little benchmark然后我做了一个小基准
fQuantMean = function(x) t(quantile(x)) %>%
as_tibble() %>% bind_cols(mean = mean(x))
fFiolka = function(){
tibble(
year = c(2010, 2011),
data = list(DT1$years[DT1$result_2010],
DT1$years[DT1$result_2011])
) %>% group_by(year) %>%
group_modify(~fQuantMean(.x$data[[1]]))
}
fFiolka()
# # A tibble: 2 x 7
# # Groups: year [2]
# year `0%` `25%` `50%` `75%` `100%` mean
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2010 -0.00697 16.4 19.9 23.3 38.6 19.9
# 2 2011 -0.633 16.5 20.0 23.4 38.6 20.0
library(data.table)
fjay_sf = function(){
melt(DT1, c(1, 4), 2:3) |>
transform(variable=substring(variable, 8)) |>
subset(value == TRUE) |>
with(aggregate(list(q=years), list(year=variable), \(x)
c(quantile(x), mean=mean(x))))
}
fjay_sf()
# year q.0% q.25% q.50% q.75% q.100% q.mean
# 1 2010 -0.006968224 16.447077579 19.947385976 23.348571278 38.636456902 19.944574420
# 2 2011 -0.633138113 16.530534403 20.043636844 23.424378551 38.636456902 20.013130400
# Warning message:
# In melt(DT1, c(1, 4), 2:3) :
# The melt generic in data.table has been passed a tbl_df and will attempt to redirect
# to the relevant reshape2 method; please note that reshape2 is deprecated, and this
# redirection is now deprecated as well. To continue using melt methods from reshape2
# while both libraries are attached, e.g. melt.list, you can prepend the namespace
# like reshape2::melt(DT1). In the next version, this warning will become an error.
cols <- grep('result_', names(DT1), value = TRUE)
get_stats_fun <- function(DT, col) {
DT %>%
filter(.data[[col]] == "TRUE") %>%
summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
"median" = round(median(years), 1),
"Mean" = round(mean(years),1)) %>%
unnest_wider(quantile)
}
fShah = function(){
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
mutate(Year = cols)
}
fShah()
# # A tibble: 2 x 8
# Year `10%` `25%` `50%` `75%` `90%` median Mean
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 result_2010 13.5 16.4 19.9 23.3 26.4 19.9 19.9
# 2 result_2011 13.4 16.5 20 23.4 26.6 20 20
library(microbenchmark)
ggplot2::autoplot(microbenchmark(fFiolka(), fjay_sf(), fShah(), times=100))
Hope the chart above explains it all.希望上面的图表能说明一切。
@r2evans please don't blame me for skipping your solution but it caused me some errors. @r2evans 请不要怪我跳过您的解决方案,但这给我带来了一些错误。
This will be my first answer, so please forgive me if I do something wrong.这将是我的第一个答案,所以如果我做错了什么,请原谅我。 By reading your question carefully, you wanted someone to help you improve your code.
通过仔细阅读您的问题,您希望有人帮助您改进代码。 Here it is, please.
在这里,请。
library(tidyverse)
library(data.table)
DT1 <- data.table(ID = 1:10,
result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
DT1 %>%
filter(result_2010 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
DT1 %>%
filter(result_2011 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
In the first case, it returns the values 4.7, 11.3, 17, 22, 27.4, 16.3.在第一种情况下,它返回值 4.7、11.3、17、22、27.4、16.3。 In the second case, it returns 12, 12, 12, 14.2, 15.6, 13.5.
在第二种情况下,它返回 12、12、12、14.2、15.6、13.5。 I see so many different answers here.
我在这里看到了很多不同的答案。 Although I honestly admit some of them I don't understand (yet).
虽然我老实承认其中一些我不明白(还)。 I really like the solution with quantile%>% tibble%>% bind_cols.
我真的很喜欢 quantile%>% tibble%>% bind_cols 的解决方案。 But knock on what I have a low reputation for pointing to this as helpful.
但是,我认为这很有帮助而声名狼藉。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.