简体   繁体   中英

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?

I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table

skip integer: the number of lines of the data file to skip before beginning to read data.

nrows integer: the maximum number of rows to read in. Negative and other invalid values are ignored.

To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.

ps I also don't know whether header=TRUE is affecting skip or not, you also have to check that.

The answer given bu @berkorbay is OK and I can confirm that header can be used with skip . However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.

I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:

#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;

print("\nEnter input file name  = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem   = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada)    || die "Cannot open input file: $!\n" ;

$cabecera  = <IN> ;
$leidas    = 0  ;
$fragmento = 1  ;
$fichero   = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
    if ($leidas > $nlineas) {
    close(OUT) ;
    $fragmento++ ;
    $fichero   = $salida.$fragmento ;
    open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
    print OUT $cabecera ;
    $leidas = 0;
    }
    $leidas++ ;
    print OUT $_ ;
}
close(OUT) ;

Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").

One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:

library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)

Once big file is read into a ff object, Subsetting ffobject into data frames can be done using: a[1000:1000000,]

Rest of the code for subsetting and saving broken dataframes totalrows = dim(a)[1] row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes

block.size = 200000000  #in bytes .IN Mbs 200 Mb

#rows.block is rows per block
rows.block = ceiling(block.size/row.size)

#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)


for(i in (0:nmaps)){
  if(i==nmaps){
    df = a[(i*rows.block+1) : totalrows,]
  }
  else{
    df = a[(i*rows.block+1) : ((i+1)*rows.block),]
  }
  #process df or save it
  write.csv(df,paste0("M",i+1,".csv"))
  #remove df
  rm(df)
}

Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;

 read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){ conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname) dbWriteTable(conn, name, file, header = header, overwrite = overwrite) on.exit(dbRemoveTable(conn, name)) command = paste0("select * from ", name) ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password)) return(ret) } 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM