简体   繁体   English

使用 cut() 和 quantile() 在 R 中存储连续列

[英]Using cut() and quantile() to bucket continuous columns in R

I have the following generated df with ages and weights我有以下生成的 df 年龄和体重

df = data.frame(
    Age = sample(18:98, 1000, replace = TRUE),
    Weight = sample(80:250, 10000, replace = TRUE)
)

I want to alter the continuous columns by creating buckets based on the quantiles (25%, 50%, 75%. This can be done like so:我想通过基于分位数(25%、50%、75%)创建桶来改变连续列。这可以这样做:

> quantile(df$Age, probs = c(0.25,0.5,0.75))
25% 50% 75% 
 39  58  78

However, I want to use the cut function using these quantiles (25%, 50%, 75%)但是,我想使用使用这些分位数 (25%, 50%, 75%) 的 cut 函数

How can I do this?我怎样才能做到这一点? I want the output to transform to something like this, where any continuous variable is converted to buckets based on the quantile (25%, 50%, 75%)我希望输出转换为这样的东西,其中任何连续变量都转换为基于分位数 (25%, 50%, 75%) 的桶

Age          Weight
(17.9,44.7]  (137,193]
(44.7,71.3]  (137,193]
(71.3,98.1]  (79.8,137]
(44.7,71.3]  (193,250]
(17.9,44.7]  (79.8,137]

Just pass your quantiles as the second argument to cut , though add a 0 and 1 quantile so that your cuts have lower and upper bounds.只需将您的分位数作为cut的第二个参数传递,尽管添加 0 和 1 分位数,以便您的切割具有下限和上限。 (ie c(0, 0.25, 0.5, 0.75, 1) , which can be concisely written as 0:4 / 4 ) (即c(0, 0.25, 0.5, 0.75, 1) ,可以简明地写为0:4 / 4

Tidyverse version Tidyverse 版本

library(dplyr)

as_tibble(df) %>% 
   mutate(across(everything(), .fn = function(x) cut(x, quantile(x, 0:4/4))))
#> # A tibble: 10,000 x 2
#>    Age     Weight   
#>    <fct>   <fct>    
#>  1 (18,38] (80,121] 
#>  2 (78,98] (121,165]
#>  3 (18,38] (121,165]
#>  4 (58,78] (208,250]
#>  5 (58,78] (165,208]
#>  6 (78,98] (80,121] 
#>  7 (38,58] (165,208]
#>  8 (58,78] (80,121] 
#>  9 (38,58] (165,208]
#> 10 (58,78] (121,165]
#> # ... with 9,990 more rows

Base R version基础 R 版本

df$Age <- cut(df$Age, quantile(df$Age, 0:4/4))
df$Weight <- cut(df$Weight, quantile(df$Weight, 0:4/4))

My santoku package has chop_quantiles() :我的santoku包有chop_quantiles()

library(santoku)
df[] <- apply(df, 2, chop_quantiles, 0:4/4)

or even simpler:甚至更简单:

df[] <- apply(df, 2, chop_equally, 4)

The empty brackets are a trick which keeps df as a data.frame.空括号是一种将df保持为 data.frame 的技巧。

If you want the raw values in your labels, you can do:如果您想要标签中的原始值,您可以执行以下操作:

df[] <- apply(df, 2, chop_equally, 4, labels = lbl_intervals(raw = TRUE))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM