如何使用任何R包（如ff或data.table）剪切大型csv文件？

Question

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. 我想剪切大型csv文件（文件大小超过RAM大小）并使用它们或将每个文件保存在磁盘中供以后使用。 Which R package is best for doing this for large files? 对于大文件，哪个R包最适合这样做？

Answer 1

I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. 我没有尝试过但在read.table或read.csv使用skip和nrows参数值得一试。 These are from ?read.table 这些来自?read.table

skip integer: the number of lines of the data file to skip before beginning to read data. skip integer：开始读取数据之前要跳过的数据文件的行数。

nrows integer: the maximum number of rows to read in. Negative and other invalid values are ignored. nrows integer：要读入的最大行数。将忽略负值和其他无效值。

To avoid some troublesome issues at the end you need to do some error handling. 为了避免一些麻烦的问题，你需要做一些错误处理。 In other words I don't know what happpens when skip value is greater than the number of rows in your big csv. 换句话说，当跳过值大于你的大csv中的行数时，我不知道什么是开心的。

ps I also don't know whether header=TRUE is affecting skip or not, you also have to check that. ps我也不知道header=TRUE是否影响skip，你还必须检查。

Answer 2

The answer given bu @berkorbay is OK and I can confirm that header can be used with skip . 给@berkorbay的答案是可以的，我可以确认标题可以与skip一起使用。 However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines. 但是，如果您的文件非常大，则会变得非常缓慢，因为在第一个之后的每个后续读取必须跳过所有先前读取的行。

I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. 我不得不做类似的事情，在浪费了相当多的时间之后，我在PERL中编写了一个简短的脚本，它将原始文件分成块，你可以一个接一个地阅读。 It is much faster. 它要快得多。 I enclose the source here, translating some parts so that the intent is clear: 我在这里附上了源代码，翻译了一些部分，以便明确意图：

#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;

print("\nEnter input file name  = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem   = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada)    || die "Cannot open input file: $!\n" ;

$cabecera  = <IN> ;
$leidas    = 0  ;
$fragmento = 1  ;
$fichero   = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
    if ($leidas > $nlineas) {
    close(OUT) ;
    $fragmento++ ;
    $fichero   = $salida.$fragmento ;
    open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
    print OUT $cabecera ;
    $leidas = 0;
    }
    $leidas++ ;
    print OUT $_ ;
}
close(OUT) ;

Just save with whatever name and execute. 只需保存任何名称并执行即可。 The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script"). 如果你在不同的地方有PERL，可能必须改变第一行（如果你在Windows上，你必须调用脚本作为“perl name-of-script”）。

Answer 3

One should have used read.csv.ffdf of ff package with specific parameters like this to read big file: 应该使用ff包的read.csv.ffdf和这样的特定参数来读取大文件：

library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)

Once big file is read into a ff object, Subsetting ffobject into data frames can be done using: a[1000:1000000,] 将大文件读入ff对象后，可以使用以下命令将ffobject子设置为数据帧：a [1000：1000000，]

Rest of the code for subsetting and saving broken dataframes totalrows = dim(a)[1] row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes 和保存用于子集的代码的其余部分破dataframes totalrows =暗淡的（a）[1] row.size = as.integer（object.size（A [1：10000，]））/ 10000 #IN字节

block.size = 200000000  #in bytes .IN Mbs 200 Mb

#rows.block is rows per block
rows.block = ceiling(block.size/row.size)

#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)


for(i in (0:nmaps)){
  if(i==nmaps){
    df = a[(i*rows.block+1) : totalrows,]
  }
  else{
    df = a[(i*rows.block+1) : ((i+1)*rows.block),]
  }
  #process df or save it
  write.csv(df,paste0("M",i+1,".csv"))
  #remove df
  rm(df)
}

Answer 4

Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below; 或者，您可以先使用dbWriteTable将文件读入mysql，然后使用ETLUtils包中的read.dbi.ffdf函数将其读回R.考虑下面的函数;

 read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){ conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname) dbWriteTable(conn, name, file, header = header, overwrite = overwrite) on.exit(dbRemoveTable(conn, name)) command = paste0("select * from ", name) ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password)) return(ret) }

如何使用任何R包（如ff或data.table）剪切大型csv文件？

问题描述

4 个解决方案

解决方案1
3 2014-05-29 09:55:34

解决方案2
3 2014-05-29 11:27:01

解决方案3
3 已采纳 2014-06-05 11:28:26

解决方案4
0 2014-12-02 14:06:21

如何使用任何R包（如ff或data.table）剪切大型csv文件？

问题描述

4 个解决方案

解决方案1 3 2014-05-29 09:55:34

解决方案2 3 2014-05-29 11:27:01

解决方案3 3 已采纳 2014-06-05 11:28:26

解决方案4 0 2014-12-02 14:06:21

解决方案1
3 2014-05-29 09:55:34

解决方案2
3 2014-05-29 11:27:01

解决方案3
3 已采纳 2014-06-05 11:28:26

解决方案4
0 2014-12-02 14:06:21