[英]Factor Levels and Modelling in R
以下代碼運行一個非常簡單的lm()
並嘗試在一個小數據框中總結結果(因子水平、系數):
df <- data.frame(star_sign = c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces"),
y = c(1.1, 1.2, 1.4, 1.3, 1.8, 1.6, 1.4, 1.3, 1.2, 1.1, 1.5, 1.3))
levels(df$star_sign) #alphabetical order
# fit a simple linear model
my_lm <- lm(y ~ 1 + star_sign, data = df)
summary(my_lm) # intercept is based on first level of factor, aquarius
# I want the levels to work properly 1..12 = Aries, Taurus...Pisces so I'm going to redefine the factor levels
df$my_levels <- c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces")
df$star_sign <- factor(df$star_sign, levels = df$my_levels)
my_lm <- lm(y ~ 1 + star_sign_, data = df)
summary(my_lm) # intercept is based on first level of factor which is now Aries
# but for my model fit I want the reference level to be Virgo (because reasons)
df$star_sign_2 <- relevel(df$star_sign, ref = "Virgo")
my_lm <- lm(y ~ 1 + star_sign_2, data = df)
summary(my_lm)
df_results <- data.frame(factor_level = names(my_lm$coefficients), coeff = my_lm$coefficients )
# tidy up
rownames(df_results) <- 1:12
df_results$factor_level <- as.factor(gsub("star_sign_2", "", df_results$factor_level))
# change label of "(Intercept)" to "Virgo"
df_results$factor_level <- plyr::revalue(df_results$factor_level, c("(Intercept)" = "Virgo"))
levels(df_results$factor_level) # the levels are alphabetical + Virgo at the front (not same as display order from lm)
因子水平的順序不正確:我想對df_results
進行排序,以便星座以與它們最初(白羊座、金牛座...雙魚座)相同的順序出現,如df$my_levels
列中所捕獲。 我認為我對操縱因素及其標簽/水平等沒有很好的理解,所以我很難知道如何做到這一點。
這也是一段冗長而笨拙的代碼。 有沒有更簡潔的方法來做這種事情?
謝謝你。
(ps 從數學上講,該模型顯然是微不足道的,但對於這些目的來說沒問題——我只是對如何操縱輸出感興趣)
以下是我將如何使用broom
包(和dplyr
)進行模型系數提取:
library(broom)
library(dplyr)
broom::tidy(my_lm) %>%
mutate(term = sub("star_sign_2", "", term),
term = ifelse(term == "(Intercept)", "Virgo", term),
term = factor(term, levels = unique(term)))
# A tibble: 12 x 5
term estimate std.error statistic p.value
<fct> <dbl> <dbl> <dbl> <dbl>
1 Virgo 1.6 NaN NaN NaN
2 Aries -0.500 NaN NaN NaN
3 Taurus -0.4 NaN NaN NaN
4 Gemini -0.2 NaN NaN NaN
5 Cancer -0.300 NaN NaN NaN
6 Leo 0.20 NaN NaN NaN
7 Libra -0.2 NaN NaN NaN
8 Scorpio -0.3 NaN NaN NaN
9 Sagittarius -0.4 NaN NaN NaN
10 Capricorn -0.500 NaN NaN NaN
11 Aquarius -0.1 NaN NaN NaN
12 Pisces -0.300 NaN NaN NaN
設置levels = unique(term)
是一個很好的技巧,可以將級別按出現的順序排列。
我的另一個建議是在數據框中按您不希望的順序保留水平向量,然后在需要建立順序時參考該向量。 例如,
astro_order = c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces")
# messy but effective:
astro_order_virgo1 = factor(astro_order, levels = astro_order) %>%
relevel("Virgo") %>%
levels()
那么你可以用term = factor(term, levels = astro_order_virgo1)
替換上面的最后一步。
這種將級別順序分開的方法很好,因為 (a) 如果您對數據框重新排序它不會改變,並且 (b) 如果您的數據框很長並且您重復輸入因子級別,它也能正常工作.
如果我理解你需要做什么,這很簡單。 只需在腳本末尾添加以下代碼即可。 我還鼓勵您深入研究 dplyr 或 tidyverse。 如果您有任何問題,請告訴我 :)
## ADDED:
#WE CREATE AN ID to maintain order in df_results
df$id <- 1:nrow(df)
library(dplyr)
#Perform left _ join (you could also do inner or right, you'll get the same result in this case )
df_results = left_join(df_results,df, by=c('factor_level'='star_sign_2'))
df_results = df_results %>% arrange(id)
# select desired columns (optionally)
df_results = df_results %>% select(factor_level,coeff)
head(df_results)
factor_level coeff
1 Aries -0.5
2 Taurus -0.4
3 Gemini -0.2
4 Cancer -0.3
5 Leo 0.2
6 Virgo 1.6
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.