[英]Using cut() and quantile() to bucket continuous columns in R
I have the following generated df with ages and weights我有以下生成的 df 年龄和体重
df = data.frame(
Age = sample(18:98, 1000, replace = TRUE),
Weight = sample(80:250, 10000, replace = TRUE)
)
I want to alter the continuous columns by creating buckets based on the quantiles (25%, 50%, 75%. This can be done like so:我想通过基于分位数(25%、50%、75%)创建桶来改变连续列。这可以这样做:
> quantile(df$Age, probs = c(0.25,0.5,0.75))
25% 50% 75%
39 58 78
However, I want to use the cut function using these quantiles (25%, 50%, 75%)但是,我想使用使用这些分位数 (25%, 50%, 75%) 的 cut 函数
How can I do this?我怎样才能做到这一点? I want the output to transform to something like this, where any continuous variable is converted to buckets based on the quantile (25%, 50%, 75%)
我希望输出转换为这样的东西,其中任何连续变量都转换为基于分位数 (25%, 50%, 75%) 的桶
Age Weight
(17.9,44.7] (137,193]
(44.7,71.3] (137,193]
(71.3,98.1] (79.8,137]
(44.7,71.3] (193,250]
(17.9,44.7] (79.8,137]
Just pass your quantiles as the second argument to cut
, though add a 0 and 1 quantile so that your cuts have lower and upper bounds.只需将您的分位数作为
cut
的第二个参数传递,尽管添加 0 和 1 分位数,以便您的切割具有下限和上限。 (ie c(0, 0.25, 0.5, 0.75, 1)
, which can be concisely written as 0:4 / 4
) (即
c(0, 0.25, 0.5, 0.75, 1)
,可以简明地写为0:4 / 4
)
Tidyverse version Tidyverse 版本
library(dplyr)
as_tibble(df) %>%
mutate(across(everything(), .fn = function(x) cut(x, quantile(x, 0:4/4))))
#> # A tibble: 10,000 x 2
#> Age Weight
#> <fct> <fct>
#> 1 (18,38] (80,121]
#> 2 (78,98] (121,165]
#> 3 (18,38] (121,165]
#> 4 (58,78] (208,250]
#> 5 (58,78] (165,208]
#> 6 (78,98] (80,121]
#> 7 (38,58] (165,208]
#> 8 (58,78] (80,121]
#> 9 (38,58] (165,208]
#> 10 (58,78] (121,165]
#> # ... with 9,990 more rows
Base R version基础 R 版本
df$Age <- cut(df$Age, quantile(df$Age, 0:4/4))
df$Weight <- cut(df$Weight, quantile(df$Weight, 0:4/4))
My santoku
package has chop_quantiles()
:我的
santoku
包有chop_quantiles()
:
library(santoku)
df[] <- apply(df, 2, chop_quantiles, 0:4/4)
or even simpler:甚至更简单:
df[] <- apply(df, 2, chop_equally, 4)
The empty brackets are a trick which keeps df
as a data.frame.空括号是一种将
df
保持为 data.frame 的技巧。
If you want the raw values in your labels, you can do:如果您想要标签中的原始值,您可以执行以下操作:
df[] <- apply(df, 2, chop_equally, 4, labels = lbl_intervals(raw = TRUE))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.