简体   繁体   中英

Variable length error in R ANOVA loop

I am currently trying to run an ANOVA on my dataframe, which has a format as such:

ethnicity sampleID batch gender gene1 gene2 gene3 ...

..up to a couple of thousand genes, with the table filled out by gene expression values.

Below is the code I am using to try and run an anova for each gene to find differences between ethnicity:

# here, 'merge' is the dataframe as described above
# set ethnicity to categorical
merge$ethnicity <- factor(merge$ethnicity, levels=c("Chinese","Malay","Indian"))

# parametric anova for each gene
baseformula <- " ~ ethnicity"
for (i in 5:ncol(merge))
{
  p <- anova(lm(colnames(merge)[i] ~ ethnicity, data=merge))  # variable lengths differ??
}

When I try running this code, I am getting the following error:

Error in model.frame.default(formula = colnames(merge)[i] ~ ethnicity, : variable lengths differ (found for 'ethnicity')

I have checked the lengths of my ethnicity column, which is the same as the lengths of my gene1 column. I have also attempted to use the na.omit() command for merge$ethnicity but it still gives the same error.

Does anyone have any suggestions as to what the problem is?

Thanks!


EDIT: Here are the first five lines for my dataframe:

Here are the first five rows and first five columns for my dataframe:

    ethnicity sample.id Batch Gender X7896759  
1           1 H60903    B6      1  6.19649  
2           1 H61603    B2      1  6.74464  
3           1 H61608    B7      2  6.20268  
4           1 H62204    B4      1  6.71395  
5           1 H62901    B7      2  6.59963

Using the code:

for (i in 5:ncol(merge))
{
  print(colnames(merge)[i])
  print(summary(aov(merge[,i] ~ merge$ethnicity)))

}

appears to be giving me the following error:

Error in levels(x)[x] : only 0's may be mixed with negative subscripts In addition: Warning messages: 1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored 2: In Ops.factor(y, z$residuals) : '-' not meaningful for factors

I generated an example. df contains a variable etnicity , with has 3 groups, and there are two genes. etnicity is your predictor variable. The loop prints the aov summary result for every gene in association to etnicity .

set.seed(1); df <- data.frame(etnicity=c('A', 'B', 'C','A', 'B', 'C','A', 'B', 'C'), gene1=rnorm(9), gene2=rnorm(9))

for(i in 2:ncol(df)){
  print(colnames(df)[i])
  print( summary( aov(df[,i] ~ df$etnicity) ) )
  }

[1] "gene1"
            Df Sum Sq Mean Sq F value Pr(>F)
df$etnicity  2  1.324  0.6619   1.006   0.42
Residuals    6  3.947  0.6579               
[1] "gene2"
            Df Sum Sq Mean Sq F value Pr(>F)
df$etnicity  2  2.436   1.218   0.977  0.429
Residuals    6  7.478   1.246 

Applied it to data mare similar to the OP's.

df <- read.table(text="ethnicity sample.id Batch Gender X7896759  
1           1 H60903    B6      1  6.19649  
2           1 H61603    B2      1  6.74464  
3           2 H61608    B7      2  6.20268  
4           2 H62204    B4      1  6.71395  
5           3 H62901    B7      2  6.59963", header=T, stringsAsFactors=F)  


for(i in 5:ncol(df)){
  print(colnames(df)[i])
  print(summary(aov(df[,i]~df$ethnicity)))
}

[1] "X7896759"
             Df  Sum Sq Mean Sq F value Pr(>F)
df$ethnicity  1 0.00803 0.00803   0.084  0.791
Residuals     3 0.28767 0.09589  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM