简体   繁体   English

R 在多个 data.frames 之间循环并为它们赋值

[英]R loop between multiple data.frames and assign values to them

I'm using R to perform some alterations in cnvkit output (for my purposes).我正在使用R在 cnvkit 输出中执行一些更改(出于我的目的)。 The thing is: doing the job sample by sample, the script works like a charm but, when I put it into a for loop, it breaks!问题是:逐个示例地完成工作,脚本就像一个魅力,但是,当我将它放入 for 循环时,它会中断!

Tried a lot of answers posted on Stack Overflow but none of then helped me.尝试了 Stack Overflow 上发布的很多答案,但没有一个对我有帮助。

# Clear workspace
rm(list=(ls()))

ref <- read.csv("/path/to/reference.cnn", header=T, sep="\t")
path <- "/path/to/call_files/"
files = list.files(path = path, pattern = "*.final.call.cnr", full.names=FALSE)
for(file in files) {
    perpos <- which(strsplit(file, "")[[1]]==".")
    assign(
    gsub(" ","",substr(file, 1, perpos-1)), 
    read.csv(paste(path,file,sep=""), header=T, sep="\t"))

}


mod_CNV = function(x) {

    # Merge both files by "start" position
    merged <- merge(files[i], ref, by="start", suffixes=c(".files[i]", ".ref"))

    # Round "log2" column
    merged$log2.D00893 <- round(merged$log2.files[i], digits=1)

    # re-calculate "cn" based on log2 correction
    merged$cn <- round(2*(2^(merged$log2.files[i])))

    # Subset file with all "cn" values that are not 2
    alt.cn <- subset(merged, merged$cn !=2)

    # Create new data with columns of interest
    alt.cns <- as.data.frame(alt.cn[, c(1:8,13)])

    # Re-order columns for better view
    alt.cns <- alt.cns[c(2,1,3,4,6,5,8,7,9)]

    # Calculate ratio between coverages
    alt.cns$depth.ratio <- round(alt.cns$depth.files[i] / alt.cns$depth.ref, digits=2)
    alt.cns$depth.ratio.1 <- round(alt.cns$depth.files[i] / alt.cns$depth.ref, digits=2)

    ## Function to call for DUP or DEL.  
    alt.cns$SV_type <- ifelse(alt.cns$cn < 2, "DEL", "DUP")

    # Convert "alt.cns" to .bed file
    full <- alt.cns[c(1,2,3,12,5,4,6,7,8,9,10)]
    names(full)[1] <- "#Chrom"
    names(full)[2] <- "Start"
    names(full)[3] <- "End"
    names(full)[4] <- "SV_type"
    names(full)[6] <- "gene"
    names(full)[7] <- "log2"

    # Save "alt.cns" as .bed file
    write.table(full, file="/path/to/output/files[i].bed", row.names=F, col.names=T, sep="\t")

    # Filter "alt.cns" file
    filtered <- subset(alt.cns, alt.cns$depth.ratio < 0.70 | alt.cns$depth.ratio > 1.40 & alt.cns$weight > 0.3)
    filtered <- filtered[c(1,2,3,12,5,4,6,7,8,9,10)]
    names(filtered)[1] <- "#Chrom"
    names(filtered)[2] <- "Start"
    names(filtered)[3] <- "End"
    names(filtered)[4] <- "SV_type"
    names(filtered)[6] <- "gene"
    names(filtered)[7] <- "log2"

    #Save file
    write.table(filtered, file="/path/to/output/files[i].bed", row.names=F, col.names=T, sep="\t")

}


for ( i in seq_along(files)) {
        mod_CNV(files[i])
    }

What I expect is that the loop reads file by file and assign each individual file name to variables files[i] and save as .pdf.我期望的是循环逐个文件读取文件并将每个单独的文件名分配给变量files[i]并另存为 .pdf。 But, I'm getting a error right on the beginning of the code:但是,我在代码的开头遇到了一个错误:

"Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column". “fix.by(by.x, x) 中的错误:'by' 必须指定唯一有效的列”。

For some reason, the loop isn't recognizing my files[i] variable, which is causing this error.出于某种原因,循环无法识别我的files[i]变量,这导致了此错误。 Can someone help me in this problem?有人可以帮我解决这个问题吗? To be clear, this error doesn't occur when running sample by sample, out of the loop.需要明确的是,在循环外逐个样本运行时不会发生此错误。

Welcome to StackOverflow!欢迎使用 StackOverflow!

You've declared a function:你已经声明了一个函数:

mod_CNV = function(x) {

    # Merge both files by "start" position
    merged <- merge(files[i], ref, by="start", suffixes=c(".files[i]", ".ref"))
    .
    .
    .
}

From what I can tell, there is no reason that this function should know what i is;据我所知,这个函数没有理由知道i是什么; this is probably why files[i] fails.这可能是files[i]失败的原因。

Here is where i is located这是i所在的地方

for ( i in seq_along(files)) {
    mod_CNV(files[i])
}

i is a variable that is local to the for loop. i是一个局部于for循环的变量。 If you want it to be available inside mod_CNV, you'd need to pass it in as a parameter.如果您希望它在 mod_CNV 中可用,则需要将其作为参数传入。

What you are passing in to mod_CNV is the filename.您传递给mod_CNV是文件名。 Inside of mod_CNV , this filename is referred to as x yet I don't see anywhere inside mod_CNV where you use x .mod_CNV内部,这个文件名被称为x但我在mod_CNV内部没有看到你使用x任何地方。

This is how you should declare your function and make use of the filename you are passing in:这是你应该如何声明你的函数并使用你传入的文件名:

mod_CNV = function(filename) {

    # Merge both files by "start" position
    merged <- merge(filename, ref, by="start", suffixes=c(filename, ".ref"))
    .
    .
    .
    # replace all other occurrences of `file[i]` with `filename`
}

And you can loop through the list of files and call mod_CNV like this, without using i :您可以遍历文件列表并像这样调用mod_CNV ,而无需使用i

for (file in files) {
    mod_CNV(file)
}

Also, I haven't used merge before and I don't know exactly what you are trying to do... but I find it odd to use an entire filename as a suffix.另外,我以前没有使用过merge ,我不知道你到底想做什么......但我发现使用整个文件名作为后缀很奇怪。 But it may be what you intended.但它可能是你想要的。

Anyway, this should be enough information for you to resolve your issue.无论如何,这应该足以让您解决问题。

For those who falls on the same problem as I, there goes the right code:对于那些与我遇到相同问题的人,有正确的代码:

path <- "/path/to/files/"
files = list.files(path = path, pattern = "*.file.ext", full.names=FALSE)
for(file in files) {
    perpos <- which(strsplit(file, "")[[1]]==".")
    assign(
    gsub(" ","",substr(file, 1, perpos-1)), 
    read.csv(paste(path,file,sep=""), header=T, sep="\t"))

}

s_ref <- read.csv("/read/ref/file", header=T, sep="\t")
s_ref["depth.ref.norm"] <- round(s_ref["depth"]/mean(s_ref[["depth"]]), digits=2)

mod_CNV = function(file) {
    file_df <- read.csv(file, header=T, sep="\t")

    # Normalize $depth by mean
    file_df[sprintf("depth.%s.norm", file)] <- round(file_df[["depth"]]/mean(file_df[["depth"]]), digits=2)

    # Merge both files by "start" position
    merged <- merge(file_df, s_ref, by="start", suffixes=c(sprintf(".%s", file), ".ref"), all=TRUE)

    # Round "log2" column
    log2_col_name = sprintf("log2.%s", file)
    merged[log2_col_name] <- round(merged[[log2_col_name]], digits=1)

    # re-calculate "cn" based on log2 correction
    merged["cn"] <- round(2*(2^(merged[[log2_col_name]])))

    # Subset file with all "cn" values that are not 2
    alt_cn <- subset(merged, merged[["cn"]] != 2)

    # Create new data with columns of interest
    alt_cns <- as.data.frame(alt_cn[, c(1:9,14,18)])

    # Re-order columns for better view
    alt_cns <- alt_cns[c(2,1,3,4,6,5,8,7,9,10,11)]

    # Calculate ratio between coverages
    alt_cns["depth.ratio.norm"] <- round(alt_cns[[sprintf("depth.%s.norm", file)]] / alt_cns[["depth.ref.norm"]], digits=2)

    alt_cns["depth.ratio"] <- round(alt_cns[[sprintf("depth.%s", file)]] / alt_cns[["depth.ref"]], digits=2)

    ## Function to call for DUP or DEL.  
    alt_cns["SV_type"] <- ifelse(alt_cns$cn < 2, "DEL", "AMP")

    # Convert "alt.cns" to .bed file
    full <- alt_cns[c(1,2,3,14,5,4,6,7,8,9,10,11,12,13)]
    names(full)[1] <- "#Chrom"
    names(full)[2] <- "Start"
    names(full)[3] <- "End"
    names(full)[4] <- "SV_type"
    names(full)[6] <- "gene"
    names(full)[7] <- "log2"

    full["weight"] <- round(full[["weight"]], digits = 2)
    full <- full[order(full$"#Chrom"),]

    # Save "full" as .bed file
    output_file = sprintf("/path/%s.bed", file)
    write.table(full, file=output_file, row.names=F, col.names=T, sep="\t", dec=",")

}
    print(files)
    for (file in files) {
        mod_CNV(file)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM