简体   繁体   English

如何使用任何R包(如ff或data.table)剪切大型csv文件?

[英]How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. 我想剪切大型csv文件(文件大小超过RAM大小)并使用它们或将每个文件保存在磁盘中供以后使用。 Which R package is best for doing this for large files? 对于大文件,哪个R包最适合这样做?

I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. 我没有尝试过但在read.tableread.csv使用skipnrows参数值得一试。 These are from ?read.table 这些来自?read.table

skip integer: the number of lines of the data file to skip before beginning to read data. skip integer:开始读取数据之前要跳过的数据文件的行数。

nrows integer: the maximum number of rows to read in. Negative and other invalid values are ignored. nrows integer:要读入的最大行数。将忽略负值和其他无效值。

To avoid some troublesome issues at the end you need to do some error handling. 为了避免一些麻烦的问题,你需要做一些错误处理。 In other words I don't know what happpens when skip value is greater than the number of rows in your big csv. 换句话说,当跳过值大于你的大csv中的行数时,我不知道什么是开心的。

ps I also don't know whether header=TRUE is affecting skip or not, you also have to check that. ps我也不知道header=TRUE是否影响skip,你还必须检查。

The answer given bu @berkorbay is OK and I can confirm that header can be used with skip . 给@berkorbay的答案是可以的,我可以确认标题可以与skip一起使用。 However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines. 但是,如果您的文件非常大,则会变得非常缓慢,因为在第一个之后的每个后续读取必须跳过所有先前读取的行。

I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. 我不得不做类似的事情,在浪费了相当多的时间之后,我在PERL中编写了一个简短的脚本,它将原始文件分成块,你可以一个接一个地阅读。 It is much faster. 快得多。 I enclose the source here, translating some parts so that the intent is clear: 我在这里附上了源代码,翻译了一些部分,以便明确意图:

#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;

print("\nEnter input file name  = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem   = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada)    || die "Cannot open input file: $!\n" ;

$cabecera  = <IN> ;
$leidas    = 0  ;
$fragmento = 1  ;
$fichero   = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
    if ($leidas > $nlineas) {
    close(OUT) ;
    $fragmento++ ;
    $fichero   = $salida.$fragmento ;
    open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
    print OUT $cabecera ;
    $leidas = 0;
    }
    $leidas++ ;
    print OUT $_ ;
}
close(OUT) ;

Just save with whatever name and execute. 只需保存任何名称并执行即可。 The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script"). 如果你在不同的地方有PERL,可能必须改变第一行(如果你在Windows上,你必须调用脚本作为“perl name-of-script”)。

One should have used read.csv.ffdf of ff package with specific parameters like this to read big file: 应该使用ff包的read.csv.ffdf和这样的特定参数来读取大文件:

library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)

Once big file is read into a ff object, Subsetting ffobject into data frames can be done using: a[1000:1000000,] 将大文件读入ff对象后,可以使用以下命令将ffobject子设置为数据帧:a [1000:1000000,]

Rest of the code for subsetting and saving broken dataframes totalrows = dim(a)[1] row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes 和保存用于子集的代码的其余部分破dataframes totalrows =暗淡的(a)[1] row.size = as.integer(object.size(A [1:10000,]))/ 10000 #IN字节

block.size = 200000000  #in bytes .IN Mbs 200 Mb

#rows.block is rows per block
rows.block = ceiling(block.size/row.size)

#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)


for(i in (0:nmaps)){
  if(i==nmaps){
    df = a[(i*rows.block+1) : totalrows,]
  }
  else{
    df = a[(i*rows.block+1) : ((i+1)*rows.block),]
  }
  #process df or save it
  write.csv(df,paste0("M",i+1,".csv"))
  #remove df
  rm(df)
}

Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below; 或者,您可以先使用dbWriteTable将文件读入mysql,然后使用ETLUtils包中的read.dbi.ffdf函数将其读回R.考虑下面的函数;

 read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){ conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname) dbWriteTable(conn, name, file, header = header, overwrite = overwrite) on.exit(dbRemoveTable(conn, name)) command = paste0("select * from ", name) ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password)) return(ret) } 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何确定 R 中 LARGE data.table 的两列中最长的公共子串 - How can I determine the longest common substring in two columns of a LARGE data.table in R R data.table:将%like%与setkey一起使用 - R data.table: using %like% with setkey 如何在r data.table中找到包含任意值的向量的行的索引? - How can I find the index of rows that contain a vector of values in any order in an r data.table? 如何使用data.table计算R中的转换表? - How can I compute a transition table in R using data.table? 将多个csv文件更快地读入data.table R. - Reading multiple csv files faster into data.table R 如何对 R data.table 中的变量词进行排序? - How can I sort words of variable in R data.table? R data.table fread命令:如何读取带有不规则分隔符的大文件? - R data.table fread command : how to read large files with irregular separators? R data.table:在文件夹中的所有 .csv 文件上使用 fread 跳过每个文件的最后一行 - R data.table: using fread on all .csv files in folder skipping the last line of each 展开两个大数据文件并使用data.table应用? - Expand two large data files and apply using data.table? 使用具有R内核的jupyter notebook,如何抑制通过引用更新data.table的打印结果? - Using jupyter notebook with an R kernel, how can I suppress printing results updating a data.table by reference?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM