I am currently trying to run an ANOVA on my dataframe, which has a format as such:
ethnicity sampleID batch gender gene1 gene2 gene3 ...
..up to a couple of thousand genes, with the table filled out by gene expression values.
Below is the code I am using to try and run an anova for each gene to find differences between ethnicity:
# here, 'merge' is the dataframe as described above
# set ethnicity to categorical
merge$ethnicity <- factor(merge$ethnicity, levels=c("Chinese","Malay","Indian"))
# parametric anova for each gene
baseformula <- " ~ ethnicity"
for (i in 5:ncol(merge))
{
p <- anova(lm(colnames(merge)[i] ~ ethnicity, data=merge)) # variable lengths differ??
}
When I try running this code, I am getting the following error:
Error in model.frame.default(formula = colnames(merge)[i] ~ ethnicity, : variable lengths differ (found for 'ethnicity')
I have checked the lengths of my ethnicity column, which is the same as the lengths of my gene1 column. I have also attempted to use the na.omit()
command for merge$ethnicity
but it still gives the same error.
Does anyone have any suggestions as to what the problem is?
Thanks!
EDIT: Here are the first five lines for my dataframe:
Here are the first five rows and first five columns for my dataframe:
ethnicity sample.id Batch Gender X7896759
1 1 H60903 B6 1 6.19649
2 1 H61603 B2 1 6.74464
3 1 H61608 B7 2 6.20268
4 1 H62204 B4 1 6.71395
5 1 H62901 B7 2 6.59963
Using the code:
for (i in 5:ncol(merge))
{
print(colnames(merge)[i])
print(summary(aov(merge[,i] ~ merge$ethnicity)))
}
appears to be giving me the following error:
Error in levels(x)[x] : only 0's may be mixed with negative subscripts In addition: Warning messages: 1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored 2: In Ops.factor(y, z$residuals) : '-' not meaningful for factors
I generated an example. df
contains a variable etnicity
, with has 3 groups, and there are two genes. etnicity
is your predictor variable. The loop
prints the aov
summary result for every gene in association to etnicity
.
set.seed(1); df <- data.frame(etnicity=c('A', 'B', 'C','A', 'B', 'C','A', 'B', 'C'), gene1=rnorm(9), gene2=rnorm(9))
for(i in 2:ncol(df)){
print(colnames(df)[i])
print( summary( aov(df[,i] ~ df$etnicity) ) )
}
[1] "gene1"
Df Sum Sq Mean Sq F value Pr(>F)
df$etnicity 2 1.324 0.6619 1.006 0.42
Residuals 6 3.947 0.6579
[1] "gene2"
Df Sum Sq Mean Sq F value Pr(>F)
df$etnicity 2 2.436 1.218 0.977 0.429
Residuals 6 7.478 1.246
Applied it to data mare similar to the OP's.
df <- read.table(text="ethnicity sample.id Batch Gender X7896759
1 1 H60903 B6 1 6.19649
2 1 H61603 B2 1 6.74464
3 2 H61608 B7 2 6.20268
4 2 H62204 B4 1 6.71395
5 3 H62901 B7 2 6.59963", header=T, stringsAsFactors=F)
for(i in 5:ncol(df)){
print(colnames(df)[i])
print(summary(aov(df[,i]~df$ethnicity)))
}
[1] "X7896759"
Df Sum Sq Mean Sq F value Pr(>F)
df$ethnicity 1 0.00803 0.00803 0.084 0.791
Residuals 3 0.28767 0.09589
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.