[英]Using fct_collapse on a subset of data
我正在嘗試建立預測 model。我的功能之一是美國各州和地區的標識符。 原始列表有 62 個唯一值,我可以使用 fct_collapse 將它們減少到 5 個值。
dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
'RI', 'SC', 'SD', 'TN', 'TX',
'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
dat$census_region <- fct_collapse(dat$state,
northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
"AR","LA","OK","TX"),
west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
"UNITED STATES MINOR OUTLYING ISLANDS","VI"))
尾巴(數據,10)
小標題:10 x 2
state | 人口普查區 |
---|---|
TX | 南 |
美國本土外小島嶼 | 其他 |
UT | 西方 |
弗吉尼亞州 | 南 |
六 | 其他 |
VT | 東北 |
西澳大利亞州 | 西方 |
無線網 | 中西部 |
西弗吉尼亞州 | 南 |
懷遠 | 西方 |
我現在正在嘗試驗證 model,而較小的數據集並沒有全部 62 個唯一的 state 標識符:
dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK'))
現在,如果我嘗試在較小的數據集上使用 fct_collapse:
dat_2$census_region <- fct_collapse(dat_2$state,
northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
"AR","LA","OK","TX"),
west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
"UNITED STATES MINOR OUTLYING ISLANDS","VI"))
我明白了:
警告消息: f
中的未知級別:CT、RI、VT、PA、WI、IA、SD、DE、FL、GA、SC、VA、DC、WV、AL、TN、AR、TX、AZ、CO、UT、 WY、AK、CA、HI、OR、WA、AA、AE、AP、AS、FM、GU、PR、美國本土外小島嶼、VI
我做了類似的事情,按照管理和預算辦公室的定義,按羅馬數字對州和領地進行分組。 我的目標是將 62 個虛擬變量減少到更易於管理的程度。
問題:在forcats
package(更具體地說是 fct_collapse)中是否有一個選項將只分配找到的那些值並跳過“未知級別”?
您可以考慮以不同的方式解決這個問題,只需執行下面的dat_2 |> left_join(dat)
。
這會從與較小樣本中的census_region
匹配的 dat 中獲取state
,並將其保留為一個因素。
library(tidyverse)
dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
'RI', 'SC', 'SD', 'TN', 'TX',
'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
dat$census_region <- fct_collapse(dat$state,
northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
"AR","LA","OK","TX"),
west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
"UNITED STATES MINOR OUTLYING ISLANDS","VI"))
dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK'))
dat_2 |> left_join(dat)
#> Joining, by = "state"
#> # A tibble: 26 × 2
#> state census_region
#> <chr> <fct>
#> 1 ID west
#> 2 IL midwest
#> 3 IN midwest
#> 4 KS midwest
#> 5 KY south
#> 6 LA south
#> 7 MA northeast
#> 8 MD south
#> 9 ME northeast
#> 10 MH other
#> # … with 16 more rows
由reprex package (v2.0.1) 創建於 2022-05-19
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.