简体   繁体   English

将多个csv文件更快地读入data.table R.

[英]Reading multiple csv files faster into data.table R

I have 900000 csv files which i want to combine into one big data.table . 我有900000个csv文件,我想把它们组合成一个大的data.table For this case I created a for loop which reads every file one by one and adds them to the data.table . 对于这种情况,我创建了一个for loop ,逐个读取每个文件并将它们添加到data.table The problem is that it is performing to slow and the amount of time used is expanding exponentially. 问题是它的执行速度变慢,所用的时间呈指数级增长。 It would be great if someone could help me make the code run faster. 如果有人可以帮助我让代码运行得更快,那就太棒了。 Each one of the csv files has 300 rows and 15 columns. 每个csv文件都有300行和15列。 The code I am using so far: 我到目前为止使用的代码:

library(data.table)
setwd("~/My/Folder")

WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}

Presuming your files are conventional csv, I'd use data.table::fread since it's faster. 假设您的文件是传统的csv,我会使用data.table::fread因为它更快。 If you're on a Linux-like OS, I would use the fact it allows shell commands. 如果您使用的是类Linux操作系统,我会使用它允许shell命令的事实。 Presuming your input files are the only csv files in the folder I'd do: 假设您的输入文件是我要执行的文件夹中唯一的csv文件:

dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")

You'll need to set the column names manually afterwards. 您需要手动设置列名称。

If you wanted to keep things in R, I'd use lapply and rbindlist : 如果你想把东西保存在R中,我会使用lapplyrbindlist

lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)

You could also use plyr::ldply : 你也可以使用plyr::ldply

dt <- setDT(ldply(csv.list, fread))

This has the advantage that you can use .progress = "text" to get a readout of progress in reading. 这样做的好处是可以使用.progress = "text"来读取读取进度。

All of the above assume that the files all have the same format and have a header row. 以上所有假设文件都具有相同的格式并具有标题行。

Building on Nick Kennedy's answer using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each. 在使用plyr::ldply Nick Kennedy的答案的基础上,通过启用.parallel选项,同时读取400 csv文件大约30-40 MB,大约提高了50%的速度。

Original answer with progress bar 带进度条的原始答案

dt <- setDT(ldply(csv.list, fread, .progress="text")

Enabling .parallel also with a text progress bar 使用文本进度条启用.parallel

library(plyr)
library(data.table)
library(doSNOW)

cl <- makeCluster(4)
registerDoSNOW(cl)

pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))

stopCluster(cl)

As suggested by @Repmat, use rbind.fill. 正如@Repmat所建议的那样,使用rbind.fill。 As suggested by @Christian Borck, use fread for faster reads. 正如@Christian Borck所建议的,使用fread来获得更快的读取速度。

require(data.table)
require(plyr)

files <- list.files("dir/name")
df <- rbind.fill(lapply(files, fread, header=TRUE))

Alternatively you could use do.call, but rbind.fill is faster ( http://www.r-bloggers.com/the-rbinding-race-for-vs-do-call-vs-rbind-fill/ ) 或者你可以使用do.call,但rbind.fill更快( http://www.r-bloggers.com/the-rbinding-race-for-vs-do-call-vs-rbind-fill/

df <- do.call(rbind, lapply(files, fread, header=TRUE))

Or you could use the data.table package, see this 或者你可以使用data.table包, 看看这个

You are growing your data table in a for loop - this is why it takes forever. 您正在for循环中增长数据表 - 这就是为什么它需要永远。 If you want to keep the for loop as is, first create a empty data frame (before the loop), which has the dimensions you need (rows x columns), and place it in the RAM. 如果要保持for循环不变,首先要创建一个空数据框(在循环之前),它具有您需要的尺寸(行x列),并将其放在RAM中。

Then write to this empty frame in each iteration. 然后在每次迭代中写入此空帧。

Otherwise use rbind.fill from package plyr - and avoid the loop altogehter. 否则使用包plyr中的rbind.fill - 并避免循环altogehter。 To use rbind.fill: 要使用rbind.fill:

require(plyr)
data <- rbind.fill(df1, df2, df3, ... , dfN)

To pass the names of the df's, you could/should use an apply function. 要传递df的名称,您可以/应该使用apply函数。

I go with @Repmat as your current solution using rbind() is copying the whole data.table in memory every time it is called (this is why time is growing exponentially). 我使用@Repmat作为当前的解决方案,使用rbind()每次调用时都会在内存中复制整个data.table(这就是为什么时间呈指数级增长)。 Though another way would be to create an empty csv file with only the headers first and then simply append the data of all your files to this csv-file. 虽然另一种方法是创建一个空的csv文件,首先只包含标题,然后只需将所有文件的数据附加到此csv文件中。

write.table(fread(i), file = "your_final_csv_file", sep = ";",
            col.names = FALSE, row.names=FALSE, append=TRUE, quote=FALSE)

This way you don't have to worry about putting the data to the right indexes in your data.table. 这样您就不必担心将数据放入data.table中的正确索引。 Also as a hint: fread() is the data.table file reader which is much faster than read.csv. 另外作为提示: fread()是data.table文件阅读器,它比read.csv快得多。

In generell R wouldn't be my first choice for this data munging tasks. 在generell中,R不会是我数据修改任务的首选。

One suggestion would be to merge them first in groups of 10 or so, and then merge those groups, and so on. 一个建议是首先将它们合并为10个左右,然后合并这些组,依此类推。 That has the advantage that if individual merges fail, you don't lose all the work. 这样做的好处是,如果单个合并失败,您就不会失去所有工作。 The way you are doing it now not only leads to exponentially slowing execution, but exposes you to having to start over from the very beginning every time you fail. 你现在这样做的方式不仅会导致执行速度呈指数级增长,而且每次失败都会让你不得不从一开始就重新开始。

This way will also decrease the average size of the data frames involved in the rbind calls, since the majority of them will be being appended to small data frames, and only a few large ones at the end. 这种方式也将减少rbind调用中涉及的数据帧的平均大小,因为它们中的大多数将被附加到小数据帧,并且最后只有几个大数据帧。 This should eliminate the majority of the execution time that is growing exponentially. 这应该消除指数增长的大部分执行时间。

I think no matter what you do it is going to be a lot of work. 我想无论你做什么,都会有很多工作要做。

Some things to consider under the assumption you can trust all the input data and that each record is sure to be unique: 在假设您可以信任所有输入数据并且每条记录肯定是唯一的情况下要考虑的一些事项:

  • Consider creating the table being imported into without indexes. 考虑创建导入的表而不使用索引。 As indexes get huge the time involved to manage them during imports grows -- so it sounds like this may be happening. 随着索引变得越来越大,在导入过程中管理它们的时间越来越长 - 所以听起来这可能正在发生。 If this is your issue it would still take a long time to create indexes later. 如果这是您的问题,稍后创建索引仍需要很长时间。

  • Alternately, with the amount of data you are discussing you may want to consider a method of partitioning the data (often done via date ranges). 或者,根据您正在讨论的数据量,您可能需要考虑一种分区数据的方法(通常通过日期范围完成)。 Depending on your database you may then have individually indexed partitions -- easing index efforts. 根据您的数据库,您可以拥有单独索引的分区 - 简化索引工作。

  • If your demonstration code doesn't resolve down to a database file import utility then use such a utility. 如果演示代码无法解析为数据库文件导入实用程序,则使用此类实用程序。

  • It may be worth processing files into larger data sets prior to importing them. 在导入文件之前,可能需要将文件处理为更大的数据集。 You could experiment with this by combining 100 files into one larger file before loading, for example, and comparing times. 例如,您可以在加载前将100个文件合并为一个较大的文件并比较时间来试验这一点。

In the event you can't use partitions (depending on the environment and the experience of the database personnel) you can use a home brewed method of seperating data into various tables. 如果您无法使用分区(取决于环境和数据库人员的经验),您可以使用自制的方法将数据分隔到各种表中。 For example data201401 to data201412. 例如data201401到data201412。 However, you'd have to roll your own utilities to query across boundaries. 但是,您必须使用自己的实用程序来跨边界查询。

While decidedly not a better option it is something you could do in a pinch -- and it would allow you to retire/expire aged records easily and without having to adjust the related indexes. 虽然绝对不是一个更好的选择,但你可以在紧要关头做一些事情 - 它可以让你轻松退休/过期老年记录而无需调整相关指数。 it would also let you load pre-processed incoming data by "partition" if desired. 如果需要,它还允许您通过“分区”加载预处理的传入数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM