根據分組權重保留最重要的因素

Question

請查看帖子末尾的片段。 我本質上是在尋找一種更清潔的方法來獲得相同的結果。 我有一個小標題，其中 x 列是一個字符向量（我沒有將它轉換為一個因子，但這實際上就是它的樣子）。 每個因素出現多次，並且總是有一個相關的數值（tibble 中的 w 列）。 我想根據相關 w 值的總和保留前 4 個因素，並將其他所有因素更改為“其他”因素。 我在下面實現了它，但我想知道是否有更聰明的方法可以使用例如 forcats 來做同樣的事情。

任何建議表示贊賞

library(tidyverse)

df <- tibble(x=rep(letters[1:10], 10), w=seq(100))

df
#> # A tibble: 100 × 2
#>    x         w
#>    <chr> <int>
#>  1 a         1
#>  2 b         2
#>  3 c         3
#>  4 d         4
#>  5 e         5
#>  6 f         6
#>  7 g         7
#>  8 h         8
#>  9 i         9
#> 10 j        10
#> # … with 90 more rows

###detect the first 4 factors based on the w column

ff <- df |>
    group_by(x) |>
    summarise(w_tot=sum(w)) |>
    ungroup() |>
    arrange(desc(w_tot)) |>
    slice(1:4) |>
    pull(x)

ff
#> [1] "j" "i" "h" "g"

## recode the data

df_new <- df |>
    mutate(w=if_else(x %in% ff, x, "other"))

df_new
#> # A tibble: 100 × 2
#>    x     w    
#>    <chr> <chr>
#>  1 a     other
#>  2 b     other
#>  3 c     other
#>  4 d     other
#>  5 e     other
#>  6 f     other
#>  7 g     g    
#>  8 h     h    
#>  9 i     i    
#> 10 j     j    
#> # … with 90 more rows

^{由代表 package (v2.0.1) 於 2022 年 9 月 16 日創建}

Answer 1

看來我可以將權重參數傳遞給 fct_lump_n() 所以這行得通

library(tidyverse)
library(forcats)



df <- tibble(x=rep(letters[1:10], 10), w=seq(100))

df
#> # A tibble: 100 × 2
#>    x         w
#>    <chr> <int>
#>  1 a         1
#>  2 b         2
#>  3 c         3
#>  4 d         4
#>  5 e         5
#>  6 f         6
#>  7 g         7
#>  8 h         8
#>  9 i         9
#> 10 j        10
#> # … with 90 more rows

###detect the first 4 factors based on the w column

ff <- df |>
    group_by(x) |>
    summarise(w_tot=sum(w)) |>
    ungroup() |>
    arrange(desc(w_tot)) |>
    slice(1:4) |>
    pull(x)

ff
#> [1] "j" "i" "h" "g"

## recode the data

df_new <- df |>
    mutate(w=if_else(x %in% ff, x, "other"))

df_new
#> # A tibble: 100 × 2
#>    x     w    
#>    <chr> <chr>
#>  1 a     other
#>  2 b     other
#>  3 c     other
#>  4 d     other
#>  5 e     other
#>  6 f     other
#>  7 g     g    
#>  8 h     h    
#>  9 i     i    
#> 10 j     j    
#> # … with 90 more rows

df_new2 <- df |>
    mutate(x2=fct_lump_n(x,4, w))


df_new2
#> # A tibble: 100 × 3
#>    x         w x2   
#>    <chr> <int> <fct>
#>  1 a         1 Other
#>  2 b         2 Other
#>  3 c         3 Other
#>  4 d         4 Other
#>  5 e         5 Other
#>  6 f         6 Other
#>  7 g         7 g    
#>  8 h         8 h    
#>  9 i         9 i    
#> 10 j        10 j    
#> # … with 90 more rows

^{由代表 package (v2.0.1) 於 2022 年 9 月 16 日創建}

根據分組權重保留最重要的因素

問題描述

1 個解決方案

解決方案1
0 2022-09-16 13:35:37

根據分組權重保留最重要的因素

問題描述

1 個解決方案

解決方案1 0 2022-09-16 13:35:37

解決方案1
0 2022-09-16 13:35:37