簡體   English   中英

對數據子集使用 fct_collapse

[英]Using fct_collapse on a subset of data

我正在嘗試建立預測 model。我的功能之一是美國各州和地區的標識符。 原始列表有 62 個唯一值,我可以使用 fct_collapse 將它們減少到 5 個值。

dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
                        'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
                        'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
                        'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                        'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                        'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
                        'RI', 'SC', 'SD', 'TN', 'TX', 
                        'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
                        'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
dat$census_region <- fct_collapse(dat$state,
    northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
    midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
    south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
         "AR","LA","OK","TX"),
    west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
    other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
         "UNITED STATES MINOR OUTLYING ISLANDS","VI"))

尾巴(數據,10)

小標題:10 x 2

state 人口普查區
TX
美國本土外小島嶼 其他
UT 西方
弗吉尼亞州
其他
VT 東北
西澳大利亞州 西方
無線網 中西部
西弗吉尼亞州
懷遠 西方

我現在正在嘗試驗證 model,而較小的數據集並沒有全部 62 個唯一的 state 標識符:

dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
                          'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                          'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                          'None', 'NV', 'NY', 'OH', 'OK'))

現在,如果我嘗試在較小的數據集上使用 fct_collapse:

dat_2$census_region <- fct_collapse(dat_2$state,
    northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
    midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
    south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
        "AR","LA","OK","TX"),
    west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
    other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
        "UNITED STATES MINOR OUTLYING ISLANDS","VI"))

我明白了:

警告消息: f中的未知級別:CT、RI、VT、PA、WI、IA、SD、DE、FL、GA、SC、VA、DC、WV、AL、TN、AR、TX、AZ、CO、UT、 WY、AK、CA、HI、OR、WA、AA、AE、AP、AS、FM、GU、PR、美國本土外小島嶼、VI

我做了類似的事情,按照管理和預算辦公室的定義,按羅馬數字對州和領地進行分組。 我的目標是將 62 個虛擬變量減少到更易於管理的程度。

問題:forcats package(更具體地說是 fct_collapse)中是否有一個選項將只分配找到的那些值並跳過“未知級別”?

您可以考慮以不同的方式解決這個問題,只需執行下面的dat_2 |> left_join(dat)

這會從與較小樣本中的census_region匹配的 dat 中獲取state ,並將其保留為一個因素。

library(tidyverse)

dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
                        'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
                        'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
                        'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                        'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                        'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
                        'RI', 'SC', 'SD', 'TN', 'TX', 
                        'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
                        'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))

dat$census_region <- fct_collapse(dat$state,
                                  northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
                                  midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
                                  south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
                                            "AR","LA","OK","TX"),
                                  west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
                                  other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
                                            "UNITED STATES MINOR OUTLYING ISLANDS","VI"))

dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
                          'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                          'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                          'None', 'NV', 'NY', 'OH', 'OK'))

dat_2 |> left_join(dat)
#> Joining, by = "state"
#> # A tibble: 26 × 2
#>    state census_region
#>    <chr> <fct>        
#>  1 ID    west         
#>  2 IL    midwest      
#>  3 IN    midwest      
#>  4 KS    midwest      
#>  5 KY    south        
#>  6 LA    south        
#>  7 MA    northeast    
#>  8 MD    south        
#>  9 ME    northeast    
#> 10 MH    other        
#> # … with 16 more rows

reprex package (v2.0.1) 創建於 2022-05-19

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM