简体   繁体   English

For 循环 R 对所有列运行卡方检验

[英]For loop in R to run Chi_square test on all columns

head(Data)
                       
          Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 
  SAMPLE1   0    1      0      0     0     1   1     1          
  SAMPLE2   0    1      0      1     0     1   0     1        
  SAMPLE3   0    0      0      1     0     0   1     1          
  SAMPLE4   1    0      0      0     1     0   0     0    

I am running fisher test and chi square test like following我正在运行 Fisher 测试和卡方测试,如下所示

Gene7_Gene8<- table (Data[, c ("Gene7", "Gene8")])
chisq.test( Gene7_Gene8) and 
fisher.test(Gene7_Gene8)

How can i run the chisq.test by for loop in all the data sets at once.我怎样才能一次在所有数据集中运行 chisq.test by for loop I am looking for p.value against each possible pair of gene in my data set.我正在寻找针对我的数据集中每对可能的基因的 p.value。 (my original data has around 60 columns) (我的原始数据大约有 60 列)

We may use combn我们可以使用combn

lst1 <- combn(seq_len(ncol(Data)), 2, function(i) 
   FUN = chisq.test(table(Data[, i])), simplify = FALSE)
names(lst1) <- combn(colnames(Data), 2, FUN = paste, collapse = "_")

I hope this works out for you.我希望这对你有用。 在此处输入图像描述

# This is a fake dataset similar to the one in the Opening post:
df <- data.frame(
  
  gene1 = sample(c(0,1),10,T),
  gene2 = sample(c(0,1),10,T),
  gene3 = sample(c(0,1),10,T),
  gene4 = sample(c(0,1),10,T),
  gene5 = sample(c(0,1),10,T)
)
rownames(df) <- paste0("sample",1:nrow(df))



# ldf stands for long df, and is a longer version, which will make it easier to analysie.
ldf <- df %>% pivot_longer(cols = everything(),names_to = "gene_type",
                           values_to = "true_false") %>%
  arrange(gene_type)
# ndf stands for nested df, and is the already nested-by-genes data frame we will analyse
ndf <- ldf %>% nest_by(gene_type)

# Here we create a new colum named p_value and in the map() function
# we extract the p values from a chiq.test erformed on each gene
ndf$p_value <- map(ndf$data,~chisq.test(table(.x))$p.value) %>% unlist()


# Now we have our final data frame and can further anlyse and create plots.
final_df <- ndf %>% unnest(cols = data)
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05)

# In my personal opinion, leaving the data frame in a long version is best suited for further anlysis.

final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05) %>% 
  ggplot(aes(x=gene_type,y=p.value,color=is_significant))+
  geom_text(aes(label=gene_type),show.legend = F)+
  labs(title = "P values of different genes")+
  theme_minimal()

No for loop is required, we can do them as such:不需要for循环,我们可以这样做:

com <- data.frame(t(combn(colnames(Data), 2)))
com
#       X1    X2
# 1  Gene1 Gene2
# 2  Gene1 Gene3
# 3  Gene1 Gene4
# 4  Gene1 Gene5
# 5  Gene1 Gene6
# 6  Gene1 Gene7
# 7  Gene1 Gene8
# 8  Gene2 Gene3
# ...

chisq <- apply(com, 1, function(i) chisq.test(table(Data[,i]))$p.value)
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# ...

chisq
#  [1] 1.0000000 0.3173105 1.0000000 0.5049851 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000 0.3173105
# [12] 1.0000000 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000

For the fisher test, we run into some problems:对于 fisher 测试,我们遇到了一些问题:

fisher <- apply(com, 1, function(i) fisher.test(table(Data[,i]))$p.value)
# Error in fisher.test(table(Data[, i])) : 
#   'x' must have at least 2 rows and columns

This is because not all tables are 2x2, so we need to catch that error and return something meaningful.这是因为并非所有表格都是 2x2,所以我们需要捕获该错误并返回一些有意义的内容。 I suggest NA :我建议NA

fisher <- apply(com, 1, function(i) tryCatch(fisher.test(table(Data[,i]))$p.value, error = function(e) NA_real_))
fisher
#  [1] 1.0000000        NA 1.0000000 0.2500000 1.0000000 1.0000000 0.2500000        NA 1.0000000 1.0000000 0.3333333
# [12] 1.0000000 1.0000000        NA        NA        NA        NA        NA 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.2500000 1.0000000 1.0000000 1.0000000

From here, we can easily combine them:从这里,我们可以轻松地将它们组合起来:

out <- cbind(com, chisq=chisq, fisher=fisher)
out
#       X1    X2     chisq    fisher
# 1  Gene1 Gene2 1.0000000 1.0000000
# 2  Gene1 Gene3 0.3173105        NA
# 3  Gene1 Gene4 1.0000000 1.0000000
# 4  Gene1 Gene5 0.5049851 0.2500000
# 5  Gene1 Gene6 1.0000000 1.0000000
# 6  Gene1 Gene7 1.0000000 1.0000000
# 7  Gene1 Gene8 0.5049851 0.2500000
# 8  Gene2 Gene3 1.0000000        NA
# 9  Gene2 Gene4 1.0000000 1.0000000
# ...

While combn does have the FUN= argument, I found it easier to reuse the results for the fisher.test automation as well.虽然combn确实有FUN=参数,但我发现将结果重用于fisher.test自动化也更容易。 It also clears up the output (in my mind) since it clearly shows which genes are considered in each comparison.它还清除了 output(在我看来),因为它清楚地显示了每次比较中考虑了哪些基因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM