[英]For with if loop in R, running on the data frame with multiple columns and multiple rows
我有一个包含36列的文件,每个第二列包含基因符号,每个第一列包含该符号的TPM值,该值是按每个副本计算的,位于每个第三列中。
这意味着第二列中的基因符号可能会在下一个单元格中重复,并且根据该基因的转录本数量,不同的基因符号可能会发生不同的次数。 我想在R中运行一个for循环,以总结同一基因符号的所有TPM并将其移入新的数据帧。
我的代码是:
for (i in 1:12)
{
for (j in 2:length(df$ref_gene_name.i))
{for (k in 2:length(df$ref_gene_name.i))
{ if (df$ref_gene_name.i[k] == df$ref_gene_name.i[k+1])
{df1$ref_gene_name.i[j] <- df$ref_gene_name.i[k]}
df1$TPM.i[j] <- df$TPM.i[k] + df$TPM.i[k+1]
}
}
}
当我运行它时,我收到错误消息:if(df $ ref_gene_name.i [k] == df $ ref_gene_name.i [k + 1])错误:{:参数长度为零。 检查单个步骤是否有错误:
k=5
df$ref_gene_name.0[k]
df$ref_gene_name.0[k] == df$ref_gene_name.0[k+2]
似乎可以正常工作并返回正确的值,如果它不是同一符号,则为False,如果它是同一符号,则为true。
不确定我的错误在哪里,不胜感激。
数据如下所示:
这个怎么样:
library(dplyr)
# Example Data (NA to simulate a partial line)
df <- data.frame("TPM"=c(0.005,0.0008,0.075),"GeneName"=c("OCT4","TERT","TERT"),"Transcript"=c("a","a","b"),
"TPM2"=c(0.005,0.0008,NA),"GeneName2"=c("OCT4","TERT",NA),"Transcript2"=c("a","a",NA))
# New data Frame, 1 column per data type
df2 <- data.frame(colnames(c("TPM","GeneName","Transcript")))
for (i in 1:(ncol(df)/3)){
e <- i*3
s <- e-2
dfn <- df[,s:e]
colnames(dfn) <- c("TPM","GeneName","Transcript")
df2 <- rbind(df2,dfn)
}
# group by gene name, sum the TPM values by gene name group and ommit any missing values from incomplete lines.
df2 %>% group_by(GeneName) %>% summarise("sumTPM"=sum(TPM)) %>% na.omit()
这可能需要一些调整,但是应该遵循这些原则。
for (i in 0:11)
{
for (j in unique(df[,paste0("ref_gene_name.",i)]))
{
print(sum(df[df[,paste0("ref_gene_name.",i)]==j, paste0("TPM.",i)], na.rm=T))
}
}
假设数据结构为以下随机数据(为可重复性而播种),请考虑以下内容:在各列之内然后跨列求和:
数据 (基因名称为统计/数字,封闭/开源,程序/语言)
gene_name <- c("SAS", "Stata", "SPSS", "Julia", "R", "Pandas")
set.seed(41918)
df <- data.frame(
TPM.0 = abs(rnorm(50))*100,
transcript_id.0 = replicate(50, paste(replicate(10, sample(LETTERS , 1, replace=TRUE)), collapse="")),
ref_gene_name.0 = replicate(50, sample(gene_name , 1, replace=TRUE)),
TPM.1 = abs(rnorm(50))*100,
transcript_id.1 = replicate(50, paste(replicate(10, sample(LETTERS , 1, replace=TRUE)), collapse="")),
ref_gene_name.1 = replicate(50, sample(gene_name , 1, replace=TRUE)),
TPM.2 = abs(rnorm(50))*100,
transcript_id.2 = replicate(50, paste(replicate(10, sample(LETTERS , 1, replace=TRUE)), collapse="")),
ref_gene_name.2 = replicate(50, sample(gene_name , 1, replace=TRUE))
)
head(df)
# TPM.0 transcript_id.0 ref_gene_name.0 TPM.1 transcript_id.1 ref_gene_name.1 TPM.2 transcript_id.2 ref_gene_name.2
# 1 86.142687 YVXKYYGWBA Stata 139.16500 IYIJLZITLR SPSS 42.39001 LFCAKYBJKI SPSS
# 2 133.150120 YZGWGGFKXG SPSS 19.46897 TULSBXMZPE SAS 88.39766 AUSWZRNRNZ Stata
# 3 139.804035 ZHPLNRNYWN Pandas 166.69469 WLUNYEPGAQ R 103.52094 CRERVAUSDU SPSS
# 4 146.847943 OTKELYDWDC SPSS 66.93809 LLOCPRBUZS R 62.43820 QZYZINREYO SAS
# 5 89.437472 NMAHZLRXJX SPSS 49.17413 VCEDDIBJHA Julia 148.03048 LTHJEDOPDB Julia
# 6 5.584601 WJLKHEBYYB Stata 88.22947 RERMEUCXGL SPSS 61.42689 HHGRPSVALV SAS
处理中
df$X <- NULL # BE SURE TO REMOVE ANYTHING BEFORE FIRST TPM
# LIST OF DATAFRAMES (EVERY 3 COLUMNS)
df_list <- lapply(seq(1, ncol(df), 3), function(i) {
tmp <- df[, c(i,(i+2))]
# NORMALIZE GENE INDICATOR COLUMN NAME
colnames(tmp)[2] <- "ref_gene_name"
# WITHIN SUM
aggregate(.~ref_gene_name, tmp, FUN=sum)
})
# CHAIN MERGE ACROSS ALL DATAFRAMES
wide_df <- Reduce(function(x, y) merge(x, y, by="ref_gene_name", all.x=TRUE), df_list)
# ACROSS SUM: ALL TPM COLUMNS
wide_df$TPM_All <- Reduce(`+`, wide_df[grep("TPM", names(wide_df))])
wide_df
# ref_gene_name TPM.0 TPM.1 TPM.2 TPM_All
# 1 Julia 1284.8478 649.3629 1250.2410 3184.452
# 2 Pandas 530.0559 590.9631 873.6411 1994.660
# 3 R 538.8770 509.3850 254.7034 1302.965
# 4 SAS 287.0210 645.4013 587.1971 1519.619
# 5 SPSS 659.0406 1008.8625 902.4517 2570.355
# 6 Stata 1095.2571 925.9412 781.9734 2803.172
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.