简体   繁体   English

在for循环中读取R中的大型csv文件

[英]read large csv files in R inside for loop

To speedup I'm setting colClasses, my readfile looks like following: 为了加快速度,我设置了colClasses,我的readfile如下所示:

readfile=function(name,save=0, rand=1)
{
        data=data.frame()

tab5rows <- read.table(name, header = TRUE, nrows = 5,sep=",")
                classes <- sapply(tab5rows, class)
                data <- read.table(pipe(paste("cat",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")
        if(save==1)
        {       
                out=paste(file,"Rdata",sep=".")
                save(data,file=out)
        }
        else
        {
                data      
        }
}

contents of myscipt.sh: myscipt.sh的内容:

#!/bin/sh
awk -v prob="$1" 'BEGIN {srand()} {if(NR==1)print $0; else if(rand() < prob) print $0;}'

In an extension to this, I needed to read file incrementaly. 作为对此的扩展,我需要增量读取文件。 Say, if file had 10 lines at 10:am and 100 lines at 11:am, I needed those newly added 90 lines + the header (without which I would not be able to implement futher R processing) I made a change to readfile funtion using the comand: data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") here skip gives me the number of lined to be tailed (calculated by some other script, lets, say I have these already). 说,如果文件在10:am有10行,在11:am有100行,那么我需要新添加的90行+标头(否则我将无法实现进一步的R处理),我对readfile函数进行了更改使用命令: data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")这里的skip给我尾行的行数(由其他脚本计算得出,假设我已经有这些行了)。 I call this function readfileIncrementally. 我将此函数称为增量readfile。

abcd are csv files each with 18 columns. abcd是csv文件,每个文件有18列。 Now I run this inside for loop say for i in abcd 现在我在for循环中在abcd中为我说

a,b,c,d are 4 files which have different values of skip. a,b,c,d是4个文件,具有不同的skip值。 Lets say skip=10,000 for a , 20,000 for b. 假设a的skip = 10,000,b的20,000。 If I run these individually (not in for loop), it runs fine. 如果我单独运行它们(不在for循环中),则运行良好。 But in case of loop it gives me error in scan line "n" does not have 18 columns. 但是在循环的情况下,它使我在扫描行“ n”中没有18列错误。 Usually this happens when skip value is greater than 3000 (approx). 通常,当跳过值大于3000(大约)时,会发生这种情况。

However I cross checked no. 但是我交叉检查没有。 of columns using command awk -F "," 'NF != 18' ./a.csv it surely has 18 columns. 使用命令awk -F "," 'NF != 18' ./a.csv它具有18列。

It looks like a timing issue to me, is there any way to give R the required amount of time before going to next file. 在我看来,这是一个计时问题,有什么办法可以让R在进入下一个文件之前需要一定的时间。 Or is there anything I'm missing. 还是我想念的东西。 On running individually it runs fine (takes few seconds though). 在单独运行时,它运行良好(尽管需要几秒钟)。

data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") worked for me. data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")为我工作。 Basically the last line was not getting written completely by the time R was reading the file. 基本上,到R读取文件时,最后一行还没有完全写入。 And hence displayed the error that line number n didn't have 18 columns. 因此显示了行号n没有18列的错误。 Making it read 1 line less works fine for me. 使它少读1行对我来说很好。

Apart from this I didn't find any R feature to overcome such scenarios. 除此之外,我没有找到任何R功能来克服这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM