[英]For loop in R to run Chi_square test on all columns
head(Data)
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8
SAMPLE1 0 1 0 0 0 1 1 1
SAMPLE2 0 1 0 1 0 1 0 1
SAMPLE3 0 0 0 1 0 0 1 1
SAMPLE4 1 0 0 0 1 0 0 0
I am running fisher test and chi square test like following我正在运行 Fisher 测试和卡方测试,如下所示
Gene7_Gene8<- table (Data[, c ("Gene7", "Gene8")])
chisq.test( Gene7_Gene8) and
fisher.test(Gene7_Gene8)
How can i run the chisq.test by for loop
in all the data sets at once.我怎样才能一次在所有数据集中运行 chisq.test by
for loop
。 I am looking for p.value against each possible pair of gene in my data set.我正在寻找针对我的数据集中每对可能的基因的 p.value。 (my original data has around 60 columns)
(我的原始数据大约有 60 列)
We may use combn
我们可以使用
combn
lst1 <- combn(seq_len(ncol(Data)), 2, function(i)
FUN = chisq.test(table(Data[, i])), simplify = FALSE)
names(lst1) <- combn(colnames(Data), 2, FUN = paste, collapse = "_")
I hope this works out for you.我希望这对你有用。
# This is a fake dataset similar to the one in the Opening post:
df <- data.frame(
gene1 = sample(c(0,1),10,T),
gene2 = sample(c(0,1),10,T),
gene3 = sample(c(0,1),10,T),
gene4 = sample(c(0,1),10,T),
gene5 = sample(c(0,1),10,T)
)
rownames(df) <- paste0("sample",1:nrow(df))
# ldf stands for long df, and is a longer version, which will make it easier to analysie.
ldf <- df %>% pivot_longer(cols = everything(),names_to = "gene_type",
values_to = "true_false") %>%
arrange(gene_type)
# ndf stands for nested df, and is the already nested-by-genes data frame we will analyse
ndf <- ldf %>% nest_by(gene_type)
# Here we create a new colum named p_value and in the map() function
# we extract the p values from a chiq.test erformed on each gene
ndf$p_value <- map(ndf$data,~chisq.test(table(.x))$p.value) %>% unlist()
# Now we have our final data frame and can further anlyse and create plots.
final_df <- ndf %>% unnest(cols = data)
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>%
mutate(is_significant=p.value<0.05)
# In my personal opinion, leaving the data frame in a long version is best suited for further anlysis.
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>%
mutate(is_significant=p.value<0.05) %>%
ggplot(aes(x=gene_type,y=p.value,color=is_significant))+
geom_text(aes(label=gene_type),show.legend = F)+
labs(title = "P values of different genes")+
theme_minimal()
No for
loop is required, we can do them as such:不需要
for
循环,我们可以这样做:
com <- data.frame(t(combn(colnames(Data), 2)))
com
# X1 X2
# 1 Gene1 Gene2
# 2 Gene1 Gene3
# 3 Gene1 Gene4
# 4 Gene1 Gene5
# 5 Gene1 Gene6
# 6 Gene1 Gene7
# 7 Gene1 Gene8
# 8 Gene2 Gene3
# ...
chisq <- apply(com, 1, function(i) chisq.test(table(Data[,i]))$p.value)
# Warning in chisq.test(table(Data[, i])) :
# Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
# Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
# Chi-squared approximation may be incorrect
# ...
chisq
# [1] 1.0000000 0.3173105 1.0000000 0.5049851 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000 0.3173105
# [12] 1.0000000 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000
For the fisher test, we run into some problems:对于 fisher 测试,我们遇到了一些问题:
fisher <- apply(com, 1, function(i) fisher.test(table(Data[,i]))$p.value)
# Error in fisher.test(table(Data[, i])) :
# 'x' must have at least 2 rows and columns
This is because not all tables are 2x2, so we need to catch that error and return something meaningful.这是因为并非所有表格都是 2x2,所以我们需要捕获该错误并返回一些有意义的内容。 I suggest
NA
:我建议
NA
:
fisher <- apply(com, 1, function(i) tryCatch(fisher.test(table(Data[,i]))$p.value, error = function(e) NA_real_))
fisher
# [1] 1.0000000 NA 1.0000000 0.2500000 1.0000000 1.0000000 0.2500000 NA 1.0000000 1.0000000 0.3333333
# [12] 1.0000000 1.0000000 NA NA NA NA NA 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.2500000 1.0000000 1.0000000 1.0000000
From here, we can easily combine them:从这里,我们可以轻松地将它们组合起来:
out <- cbind(com, chisq=chisq, fisher=fisher)
out
# X1 X2 chisq fisher
# 1 Gene1 Gene2 1.0000000 1.0000000
# 2 Gene1 Gene3 0.3173105 NA
# 3 Gene1 Gene4 1.0000000 1.0000000
# 4 Gene1 Gene5 0.5049851 0.2500000
# 5 Gene1 Gene6 1.0000000 1.0000000
# 6 Gene1 Gene7 1.0000000 1.0000000
# 7 Gene1 Gene8 0.5049851 0.2500000
# 8 Gene2 Gene3 1.0000000 NA
# 9 Gene2 Gene4 1.0000000 1.0000000
# ...
While combn
does have the FUN=
argument, I found it easier to reuse the results for the fisher.test
automation as well.虽然
combn
确实有FUN=
参数,但我发现将结果重用于fisher.test
自动化也更容易。 It also clears up the output (in my mind) since it clearly shows which genes are considered in each comparison.它还清除了 output(在我看来),因为它清楚地显示了每次比较中考虑了哪些基因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.