使用正則表達式匹配編輯（重新編碼、折疊、排序）因子級別

Question

我發現在 R 中操縱因子變量過於復雜。 清潔因素時我經常想做的事情包括：

調整級別 - 不僅可以設置參考類別，還可以將所有級別按邏輯（非字母順序）排列在匯總表中。 x <- factor(x, levels = new.order)
重新編碼/重命名因子級別 - 簡化名稱和/或將多個類別合並為一組。 對於一對一的重新編碼levels(x) <- new.levels(x)或plyr::revalue ，請參閱此處或此處以獲取示例。 car::recode可以在單個語句中執行多個一對多匹配，但不支持正則表達式匹配。
刪除級別 - 不僅刪除未使用的級別，而且將某些級別設置為缺失。 （例如那些有錯誤代碼的）。 x <- factor(as.character(x), exclude = drop.levels)
添加級別 - 顯示零計數的類別。

最好有一個函數可以同時完成上述所有操作，允許對重新編碼和刪除因子進行模糊（正則表達式）匹配，可以在其他函數中使用（例如lapply或dplyr::mutate ），並且具有簡單（一致）的語法。

我已經發布了我對此的最佳嘗試作為下面的答案，但是如果我錯過了一個已經存在的函數或者代碼是否可以改進，請告訴我。

編輯

我已經知道forcats包，它是用於處理 Categorical Variables (Factors) 的副標題工具。 該軟件包有許多選項可用於重新排序級別（'fct_infreq'、'fct_reorder'、'fct_relevel'、...）、重新編碼/分組級別（'fct_recode'、'fct_lump'、'fct_collapse'）、刪除級別（'fct_recode' )，並添加級別 ('fct_expand')。 但是沒有計划支持正則表達式匹配（ https://github.com/tidyverse/forcats/issues/214 ）。

Answer 1

編輯：幾年后，我在 github 上添加了xfactor函數來完成上述操作。 它仍在進行中，所以請讓我知道是否有任何錯誤等。

devtools::install_github("jwilliman/xfactor")


library(xfactor)

# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt"    "dogfish" "mouse"   "rabbit"

# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit  catfish mouse   dirt   
#> Levels: mouse rabbit catfish dirt dogfish

xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA>   rabbit <NA>   mouse  <NA>  
#> Levels: mouse rabbit

# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.

xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea  Land Sea  Land dirt
#> Levels: Sea Land dirt

# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement

xfactor(x, exclude = "fish")
#> [1] <NA>   rabbit <NA>   mouse  dirt  
#> Levels: dirt mouse rabbit

# The function will work within other functions

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
  mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#>   n       x    y
#> 1 1 dogfish  Sea
#> 2 2  rabbit Land
#> 3 3 catfish  Sea
#> 4 4   mouse Land
#> 5 5    dirt <NA>

^{由reprex 包(v0.3.0) 於 2020 年 4 月 16 日創建}

使用正則表達式匹配編輯（重新編碼、折疊、排序）因子級別

問題描述

1 個解決方案

解決方案1
2 已采納 2016-06-13 23:59:09

使用正則表達式匹配編輯（重新編碼、折疊、排序）因子級別

問題描述

1 個解決方案

解決方案1 2 已采納 2016-06-13 23:59:09

解決方案1
2 已采納 2016-06-13 23:59:09