简体   繁体   English

快速合并和转置许多固定格式的数据集文件

[英]Combine and transpose many fixed-format dataset files quickly

What I have: ~100 txt files, each has 9 columns and >100,000 rows What I want: A combined file, with only 2 of the columns but all the rows. 我所拥有的:〜100个txt文件,每个文件有9列和> 100,000行。我想要的是:一个组合文件,其中只有2列,但所有行。 then this should be transposed for an output of >100,000 columns & 2 rows. 那么应该将其转置为输出> 100,000列和2行。

I've created the below function to go systematically through the files in a folder, pull the data I want, and then after each file, join together with the original template. 我创建了以下功能来系统地浏览文件夹中的文件,提取所需的数据,然后在每个文件之后将其与原始模板结合在一起。

Problem: This works fine on my small test files, but when I try doing it on large files, I run into a memory allocation issue. 问题:这在我的小型测试文件上运行正常,但是当我尝试在大型文件上运行时,遇到了内存分配问题。 My 8GB of RAM just isn't enough, and I assume that part of that is in how I wrote my code. 我的8GB RAM不够用,我认为其中一部分在于我编写代码的方式。

My Question: Is there a way to loop through the files and then join all at once at the end to save processing time? 我的问题:有没有一种方法可以遍历文件,然后一次全部加入以节省处理时间?

Also, if this is the wrong place to put this kind of thing, what is a better forum to get input on WIP code?? 此外,如果在这地方放错了这种东西,那么哪个更好的论坛可以获取有关WIP代码的信息?

##Script to pull in genotype txt files, transpose them, delete commented rows & 
## & header rows, and then put files together.

library(plyr)

## Define function
Process_Combine_Genotype_Files <- function(
        inputdirectory = "Rdocs/test", outputdirectory = "Rdocs/test", 
        template = "Rdocs/test/template.txt",
        filetype = ".txt", vars = ""
        ){

## List the files in the directory & put together their path
        filenames <- list.files(path = inputdirectory, pattern = "*.txt")
        path <- paste(inputdirectory,filenames, sep="/")


        combined_data <- read.table(template,header=TRUE, sep="\t")

## for-loop: for every file in directory, do the following
        for (file in path){

## Read genotype txt file as a data.frame
                currentfilename  <- deparse(substitute(file))
                currentfilename  <- strsplit(file, "/")
                currentfilename <- lapply(currentfilename,tail,1)

                data  <- read.table(file, header=TRUE, sep="\t", fill=TRUE)

                #subset just the first two columns (Probe ID & Call Codes)
                #will need to modify this for Genotype calls....
                data.calls  <- data[,1:2]

                #Change column names & row names
                colnames(data.calls)  <- c("Probe.ID", currentfilename)
                row.names(data.calls) <- data[,1]


## Join file to previous data.frame
                combined_data <- join(combined_data,data.calls,type="full")


## End for loop
        }
## Merge all files
        combined_transcribed_data  <- t(combined_data)
print(combined_transcribed_data[-1,-1])
        outputfile  <- paste(outputdirectory,"Genotypes_combined.txt", sep="/")        
        write.table(combined_transcribed_data[-1,-1],outputfile, sep="\t")

## End function
}

Thanks in advance. 提前致谢。

Try: 尝试:

filenames <- list.files(path = inputdirectory, pattern = "*.txt")
require(data.table)
data_list <- lapply(filenames,fread, select = c(columns you want to keep))

now you have a list of all you data. 现在您有了所有数据的列表。 Assuming all the txt-files do have the same column-structure you can combine them via: 假设所有txt文件都具有相同的列结构,则可以通过以下方式将它们组合:

data <- rbindlist(data_list)

transposing data: 转置数据:

t(data)

(Thanks to @Jakob H for select in fread) (感谢@Jakob H在select中的select

If speed/working memory is the concern then I would recommend using Unix to do the merging. 如果需要考虑速度/工作内存,那么我建议使用Unix进行合并。 In general, Unix is faster than R. Further, Unix does not require that all information be loaded into RAM, rather it reads information in chunks. 通常,Unix比R更快。此外,Unix不需要将所有信息都加载到RAM中,而是以块的形式读取信息。 Consequently, Unix is never memory bound. 因此,Unix永远不受内存限制。 If you don't know Unix but plan to manipulate large files frequently in the future, then learn Unix. 如果您不了解Unix,但计划在将来频繁处理大型文件,请学习Unix。 It is simple to learn and very powerful. 它简单易学,功能强大。 I will do an example with csv files. 我将以csv文件为例。

Generating CSV files in R 在R中生成CSV文件

for (i in 1:10){
  write.csv(matrix(rpois(1e5*10,1),1e5,10), paste0('test',i,'.csv'))
}

In Shell (ie on a Mac)/Terminal (ie on a Linux Box)/Cygwin (ie on Windows) 在Shell中(即在Mac上)/终端(即在Linux Box上)/ Cygwin(即在Windows上)

cut -f 2,3 -d , test1.csv > final.csv #obtain column 2 and 3 form test1.csv
cut -f 2,3 -d , test[2,9].csv test10.csv | sed 1d >> final.csv #removing header in test2.csv onward 

Notice if you have installed Rtools, then you can run all these Unix commands from R with the system function. 请注意,如果您已经安装了Rtools,则可以使用system功能从R运行所有这些Unix命令。

To transpose read final.csv into R and transpose. 要转置,请将final.csv读入R并转置。

UPDATE: 更新:

I timed the above code. 我为上面的代码计时。 It took .4 secs to run. 花了0.4秒钟来运行。 Consequently to do this for 100 files rather than just 10 files it will likely take 4 secs . 因此,要对100个文件而不是10个文件执行此操作,可能需要4秒钟 I have not timed the R code, however, it may be the case that the Unix and R program have similar performance when there is only 10 files, however, with 100+ files, your computer will likely become memory bound and R will likely crash. 我尚未对R代码进行计时,但是,当只有10个文件时,Unix和R程序可能具有相似的性能,但是,如果文件数量超过100个,则计算机可能会受到内存的限制,R可能会崩溃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM