[英]Using cut() and quantile() to bucket continuous columns in R
我有以下生成的 df 年齡和體重
df = data.frame(
Age = sample(18:98, 1000, replace = TRUE),
Weight = sample(80:250, 10000, replace = TRUE)
)
我想通過基於分位數(25%、50%、75%)創建桶來改變連續列。這可以這樣做:
> quantile(df$Age, probs = c(0.25,0.5,0.75))
25% 50% 75%
39 58 78
但是,我想使用使用這些分位數 (25%, 50%, 75%) 的 cut 函數
我怎樣才能做到這一點? 我希望輸出轉換為這樣的東西,其中任何連續變量都轉換為基於分位數 (25%, 50%, 75%) 的桶
Age Weight
(17.9,44.7] (137,193]
(44.7,71.3] (137,193]
(71.3,98.1] (79.8,137]
(44.7,71.3] (193,250]
(17.9,44.7] (79.8,137]
只需將您的分位數作為cut
的第二個參數傳遞,盡管添加 0 和 1 分位數,以便您的切割具有下限和上限。 (即c(0, 0.25, 0.5, 0.75, 1)
,可以簡明地寫為0:4 / 4
)
Tidyverse 版本
library(dplyr)
as_tibble(df) %>%
mutate(across(everything(), .fn = function(x) cut(x, quantile(x, 0:4/4))))
#> # A tibble: 10,000 x 2
#> Age Weight
#> <fct> <fct>
#> 1 (18,38] (80,121]
#> 2 (78,98] (121,165]
#> 3 (18,38] (121,165]
#> 4 (58,78] (208,250]
#> 5 (58,78] (165,208]
#> 6 (78,98] (80,121]
#> 7 (38,58] (165,208]
#> 8 (58,78] (80,121]
#> 9 (38,58] (165,208]
#> 10 (58,78] (121,165]
#> # ... with 9,990 more rows
基礎 R 版本
df$Age <- cut(df$Age, quantile(df$Age, 0:4/4))
df$Weight <- cut(df$Weight, quantile(df$Weight, 0:4/4))
我的santoku
包有chop_quantiles()
:
library(santoku)
df[] <- apply(df, 2, chop_quantiles, 0:4/4)
甚至更簡單:
df[] <- apply(df, 2, chop_equally, 4)
空括號是一種將df
保持為 data.frame 的技巧。
如果您想要標簽中的原始值,您可以執行以下操作:
df[] <- apply(df, 2, chop_equally, 4, labels = lbl_intervals(raw = TRUE))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.