[英]Bind rows of data frames with some factor columns
我想创建一个dplyr dplyr::bind_rows
的suped-up版本,它避免了Unequal factor levels: coercing to character
当我们尝试组合的dfs中存在因子列时(可能还有非因子列), Unequal factor levels: coercing to character
警告。 这是一个例子:
df1 <- dplyr::data_frame(age = 1:3, gender = factor(c("male", "female", "female")), district = factor(c("north", "south", "west")))
df2 <- dplyr::data_frame(age = 4:6, gender = factor(c("male", "neutral", "neutral")), district = factor(c("central", "north", "east")))
然后bind_rows_with_factor_columns(df1, df2)
返回(没有警告):
dplyr::data_frame(
age = 1:6,
gender = factor(c("male", "female", "female", "male", "neutral", "neutral")),
district = factor(c("north", "south", "west", "central", "north", "east"))
)
这是我到目前为止所拥有的:
bind_rows_with_factor_columns <- function(...) {
factor_columns <- purrr::map(..., function(df) {
colnames(dplyr::select_if(df, is.factor))
})
if (length(unique(factor_columns)) > 1) {
stop("All factor columns in dfs must have the same column names")
}
df_list <- purrr::map(..., function (df) {
purrr::map_if(df, is.factor, as.character) %>% dplyr::as_data_frame()
})
dplyr::bind_rows(df_list) %>%
purrr::map_at(factor_columns[[1]], as.factor) %>%
dplyr::as_data_frame()
}
我想知道是否有人对如何合并forcats
包有任何想法,可能避免不得不强迫因素到字符,或者如果有人有任何建议通常提高性能,同时保持相同的功能(我喜欢坚持tidyverse
语法)。 谢谢!
根据朋友的一个很好的解决方案来回答我自己的问题:
bind_rows_with_factor_columns <- function(...) {
purrr::pmap_df(list(...), function(...) {
cols_to_bind <- list(...)
if (all(purrr::map_lgl(cols_to_bind, is.factor))) {
forcats::fct_c(cols_to_bind)
} else {
unlist(cols_to_bind)
}
})
}
使用带有警告抑制的dplyr::bind_rows
可能更简单,然后将所有新字符列转换回因子。 这具有通过列名绑定data.frames
(允许列的不同排序和包含额外列)的优点,并且当因子变量有时被记录为字符时仍然有效。
library(tidyverse)
bind_rows_keep_factors <- function(...) {
## Identify all factors
factors <- unique(unlist(
map(list(...), ~ select_if(..., is.factor) %>% names())
))
## Bind dataframes, convert characters back to factors
suppressWarnings(bind_rows(...)) %>%
mutate_at(vars(one_of(factors)), factor)
}
dat1 <- tibble(
id = 1:2,
fruit = factor(c("banana", "apple"))
)
dat2 <- tibble(
id = 3:4,
fruit = c("pear", "banana"),
taste = c("Mmmm", "yum")
)
bind_rows_keep_factors(dat1, dat2)
# A tibble: 4 x 3
id fruit taste
<int> <fct> <chr>
1 1 banana NA
2 2 apple NA
3 3 pear Mmmm
4 4 banana yum
当然,因子级别的排序被中断(恢复为字母顺序)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.