[英]Run chi-square test in all columns for a data_frame using dplyr
There are several similar questions that grab chi-square
results, but that solves my problem. 有几个类似的问题可以得出chi-square
结果,但这可以解决我的问题。 I'd like to calculate p.values from chi-square
tests for all columns in a data_frame
and store them in a column within the original data_frame
. 我想通过chi-square
检验为data_frame
所有列计算p.value并将它们存储在原始data_frame
中的一列中。 There will be duplicate values which I'm fine with. 我会有很好的重复值。 Ultimately, I'd like to select
all columns in a data_frame
that have a p.value lower than x with my variable of choice. 最后,我想select
的所有列data_frame
有一个p.value比X与我选择的可变低。
require(dplyr)
my_df <- data_frame(
one_f = sample(LETTERS[1:5],100,T),
two_f = sample(LETTERS[4:5],100,T),
three_f = sample(LETTERS[5],100,T)
)
my_df %>%
head()
my_df %>%
summarise_all(funs(chisq.test(.,my_df$two_f)$p.value))
Gets me this error: 让我知道这个错误:
Error in summarise_impl(.data, dots) :
Evaluation error: 'x' and 'y' must have at least 2 levels.
my_df %>%
mutate_if(n_distinct>1,fun(chisq.test(.,my_df$two_f)$p.value))
Get me this error: 让我这个错误:
Error in n_distinct > 1 :
comparison (6) is possible only for atomic and list types
I'm looking for something like this. 我正在寻找这样的东西。
my_df %>%
mutate(p.value = sample(c(0.043,0.87,0.00),nrow(.),T)) %>%
head()
Then I plan to use gather
and filter
then spread
to get the significantly associated variables according to my chi-square
test. 然后,我计划根据我的chi-square
检验使用gather
和filter
然后进行spread
以获取显着关联的变量。
I suppose 我想
my_df %>% filter(foo,bar >= 0.05)#function that finds p.values and filters by
# alpha level
would be my ultimate goal. 这将是我的最终目标。
require(dplyr)
require(tidyr)
my_df <- data_frame(
one_f = sample(LETTERS[1:5],100,T),
two_f = sample(LETTERS[4:5],100,T),
three_f = sample(LETTERS[5],100,T)
)
# select all column names where the column has more than 1 distinct values
my_df %>%
summarise_all(function(x) length(unique(x))) %>%
gather() %>%
filter(value > 1) %>%
pull(key) -> list_cols
# apply function only to those columns
my_df %>%
select(list_cols) %>%
summarise_all(funs(chisq.test(.,my_df$two_f)$p.value))
# # A tibble: 1 x 2
# one_f two_f
# <dbl> <dbl>
# 1 0.880 0.000000000000000000000120
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.