R：因子水平，重新編碼為'其他'

Question

我使用的因素很少，並且通常會發現它們易於理解，但我常常對特定操作的細節模糊不清。 目前，我正在編寫/折疊類別，很少有觀察到“其他”，我正在尋找一個快速的方法來做到這一點 - 我有一個或許20級的變量，但我有興趣將它們中的一堆折疊成一個。

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))

以下是我感興趣的級別，以及它們在不同向量中的標簽。

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')

我可以使用factor()調用，枚舉它們，每次類別幾乎沒有觀察時將其分類為“其他”。

假設上面的top8和top8_desc是實際的前8位，那么將data$naics聲明為因子變量的最佳方法是什么，以便top8中的值被正確編碼而其他所有內容都被重新編碼為other ？

Answer 1

我認為最簡單的方法是將不在前8位的所有naics重新標記為特殊值。

data$naics[!(data$naics %in% top8)] = -99

然后，您可以在將其轉換為因子時使用“排除”選項

factor(data$naics, exclude=-99)

Answer 2

你可以使用forcats::fct_other() ：

library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')

或者使用fct_other()作為dplyr::mutate() ：

library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other')) 

data %>% head(10)
   employees  naics
1        420  other
2        264  other
3        189  other
4        157 621610
5        376 621610
6        236  other
7        658 621320
8        959 621320
9        216  other
10       156  other

請注意，如果未設置參數other_level ，則其他級別默認為“其他”（大寫“O”）。

相反，如果您只想將幾個因素轉換為“其他”，則可以使用參數drop ：

data %>%  
  mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
         drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>% 
  head(10)

   employees  naics keep_fct drop_fct
1        474 621491    other   621491
2        805 621111   621111    other
3        434 621910    other   621910
4        845 621111   621111    other
5        243 621340    other   621340
6        466 621493    other   621493
7        369 621111   621111    other
8         57 621493    other   621493
9        144 621491    other   621491
10       786 621910    other   621910

dpylr也有recode_factor() ，你可以將.default參數設置為other，但是要重新編碼的級別數較多，就像這個例子一樣，可能很乏味：

data %>% 
   mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))

Answer 3

遲到的

這是plyr::mapvalues的包裝器，它允許remaining參數（你的other參數）

library(plyr)

Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
  if(!is.null(remaining)){
    therest <- setdiff(x, from)
    from <- c(from, therest)
    to <- c(to, rep_len(remaining, length(therest)))
  }
  mapvalues(x, from, to, warn_missing)
}
# replace the remaining values with "other"
Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
# leave the remaining values alone
Mapvalues(data$naics, top8, top8_desc)

Answer 4

我已經寫了一個功能，可以對其他人有用嗎？ 我首先以相對的方式檢查，如果一個水平發生的數量低於基數的mp％。 之后，我檢查將最大級別數限制為ml。

ds是data.frame類型的數據集，我對cat_var_names中出現的所有列執行此操作。

cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])

recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
  # remove less frequent levels in factor
  # 
  n <- nrow(ds)
  # keep levels with more then mp percent of cases
  for (i in var_list){
    keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }

  # keep top ml levels
  for (i in var_list){
    keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }
  return(ds)
}

R：因子水平，重新編碼為'其他'

問題描述

4 個解決方案

解決方案1
6 已采納 2013-03-20 20:23:43

解決方案2
4 2018-04-06 21:15:13

解決方案3
3 2013-08-21 01:40:44

解決方案4
0 2013-08-20 13:51:31

R：因子水平，重新編碼為'其他'

問題描述

4 個解決方案

解決方案1 6 已采納 2013-03-20 20:23:43

解決方案2 4 2018-04-06 21:15:13

解決方案3 3 2013-08-21 01:40:44

解決方案4 0 2013-08-20 13:51:31

解決方案1
6 已采納 2013-03-20 20:23:43

解決方案2
4 2018-04-06 21:15:13

解決方案3
3 2013-08-21 01:40:44

解決方案4
0 2013-08-20 13:51:31