[英]R: Grouping by variable THEN counting/filtering by occurrences of another
我有一个分类变量和状态的数据框。 对于每个州,我想找到最常见的分类变量,并过滤掉其余的。
例如
1 Alabama cat_variable_1
2 Alabama cat_variable_2
3 Alabama cat_variable_2
4 Alabama cat_variable_3
对于阿拉巴马州,cat_variable_2 将是最常见的 - 因此带有 cat_variable_2 的行将是阿拉巴马州此数据框中剩余的全部内容。 这将针对每个州进行。
1 Alabama cat_variable_2
2 Alabama cat_variable_2
非常感谢你!
您可以过滤每个State
中出现最大次数的变量。
library(dplyr)
df %>% group_by(state) %>% filter(variable == names(which.max(table(variable))))
# state variable
# <chr> <chr>
#1 Alabama cat_variable_2
#2 Alabama cat_variable_2
您也可以在基数 R 中编写此内容:
subset(df, as.logical(ave(variable, state,
FUN = function(x) x == names(which.max(table(x))))))
和数据data.table
:
library(data.table)
setDT(df)[, .SD[variable == names(which.max(table(variable)))], state]
数据
df <- structure(list(state = c("Alabama", "Alabama", "Alabama", "Alabama"
), variable = c("cat_variable_1", "cat_variable_2", "cat_variable_2",
"cat_variable_3")), row.names = c(NA, -4L), class = "data.frame")
一种方法是使用您想要的组合创建一个新的 df,然后在原始 df 上使用dplyr::inner_join
以仅保留这些组合。
library(dplyr)
## An example df with two "states" with different most common cat_var.
df <- tibble(
state = gl(2, 50, labels = c("AL", "NY")),
cat_var = case_when(
state == "AL" ~ sample(1:3, 100, TRUE, prob = c(.2, .3, .5)),
state == "NY" ~ sample(1:3, 100, TRUE, prob = c(.5, .3, .2))
),
y = rnorm(100)
)
## Keeps the cat_var in each state that is most common, giving a df
## with each state--cat_var comb that we can filter against.
state_vars <-
df %>%
count(state, cat_var, sort = TRUE) %>%
group_by(state) %>%
slice(1) %>%
ungroup()
## Use `inner_join` to only keep those comb in `state_vars`.
inner_join(df, state_vars, by = c("state", "cat_var"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.