read large csv files in R inside for loop

Question

To speedup I'm setting colClasses, my readfile looks like following:

readfile=function(name,save=0, rand=1)
{
        data=data.frame()

tab5rows <- read.table(name, header = TRUE, nrows = 5,sep=",")
                classes <- sapply(tab5rows, class)
                data <- read.table(pipe(paste("cat",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")
        if(save==1)
        {       
                out=paste(file,"Rdata",sep=".")
                save(data,file=out)
        }
        else
        {
                data      
        }
}

contents of myscipt.sh:

#!/bin/sh
awk -v prob="$1" 'BEGIN {srand()} {if(NR==1)print $0; else if(rand() < prob) print $0;}'

In an extension to this, I needed to read file incrementaly. Say, if file had 10 lines at 10:am and 100 lines at 11:am, I needed those newly added 90 lines + the header (without which I would not be able to implement futher R processing) I made a change to readfile funtion using the comand: data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") here skip gives me the number of lined to be tailed (calculated by some other script, lets, say I have these already). I call this function readfileIncrementally.

abcd are csv files each with 18 columns. Now I run this inside for loop say for i in abcd

a,b,c,d are 4 files which have different values of skip. Lets say skip=10,000 for a , 20,000 for b. If I run these individually (not in for loop), it runs fine. But in case of loop it gives me error in scan line "n" does not have 18 columns. Usually this happens when skip value is greater than 3000 (approx).

However I cross checked no. of columns using command awk -F "," 'NF != 18' ./a.csv it surely has 18 columns.

It looks like a timing issue to me, is there any way to give R the required amount of time before going to next file. Or is there anything I'm missing. On running individually it runs fine (takes few seconds though).

Answer 1

data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") worked for me. Basically the last line was not getting written completely by the time R was reading the file. And hence displayed the error that line number n didn't have 18 columns. Making it read 1 line less works fine for me.

Apart from this I didn't find any R feature to overcome such scenarios.

read large csv files in R inside for loop

Question

1 answers

solution1
0 ACCPTED 2015-02-01 09:30:10

read large csv files in R inside for loop

Question

1 answers

solution1 0 ACCPTED 2015-02-01 09:30:10

solution1
0 ACCPTED 2015-02-01 09:30:10