简体   繁体   中英

read large csv files in R inside for loop

To speedup I'm setting colClasses, my readfile looks like following:

readfile=function(name,save=0, rand=1)
{
        data=data.frame()

tab5rows <- read.table(name, header = TRUE, nrows = 5,sep=",")
                classes <- sapply(tab5rows, class)
                data <- read.table(pipe(paste("cat",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")
        if(save==1)
        {       
                out=paste(file,"Rdata",sep=".")
                save(data,file=out)
        }
        else
        {
                data      
        }
}

contents of myscipt.sh:

#!/bin/sh
awk -v prob="$1" 'BEGIN {srand()} {if(NR==1)print $0; else if(rand() < prob) print $0;}'

In an extension to this, I needed to read file incrementaly. Say, if file had 10 lines at 10:am and 100 lines at 11:am, I needed those newly added 90 lines + the header (without which I would not be able to implement futher R processing) I made a change to readfile funtion using the comand: data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") here skip gives me the number of lined to be tailed (calculated by some other script, lets, say I have these already). I call this function readfileIncrementally.

abcd are csv files each with 18 columns. Now I run this inside for loop say for i in abcd

a,b,c,d are 4 files which have different values of skip. Lets say skip=10,000 for a , 20,000 for b. If I run these individually (not in for loop), it runs fine. But in case of loop it gives me error in scan line "n" does not have 18 columns. Usually this happens when skip value is greater than 3000 (approx).

However I cross checked no. of columns using command awk -F "," 'NF != 18' ./a.csv it surely has 18 columns.

It looks like a timing issue to me, is there any way to give R the required amount of time before going to next file. Or is there anything I'm missing. On running individually it runs fine (takes few seconds though).

data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") worked for me. Basically the last line was not getting written completely by the time R was reading the file. And hence displayed the error that line number n didn't have 18 columns. Making it read 1 line less works fine for me.

Apart from this I didn't find any R feature to overcome such scenarios.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM