R looping through 20 million rows

Question

I have a .txt File called Sales_2015 which has almost 1GB of info. The file has the following columns:

AREA|WEEKNUMBER|ITEM|STORE_NO|SALES|UNITS_SOLD
10GUD| W01_2015 |0345| 023234 |1200 | 12

The File´s colClasses is: c(rep("character",4),rep("numeric",2))

What I want to do is separate the 1GB file into pieces so it becomes faster to read. The number of .txt files I want to end up with will be defined by the number of AREAS I have. (Which is the first column).

So I have the following variables:

Sales <- read.table(paste(RUTAC,"/Sales_2015.txt",sep=""),sep="|",header=T, quote="",comment.char="",colClasses=c("character",rep("numeric",3)))

Areas <- c("10GUD","10CLJ","10DZV",..................) #There is 52 elements

I Want to end up with 52 .txt files which names are for instance:

2015_10GUD.txt (Which will only include entire rows of info from 1GB file that contain 10GUD in the AREA Column)

2015_10CLJ.txt (Which will only include entire rows of info from 1GB file that contain 10CLJ )

I know this question is very similar to others but the difference is that I am working with a up to 20 million rows...Can anybody help me get this done with some sort of loop such as repeat or something else?

Answer 1

No need to use a loop. The simplest and fastest way to do this is probably using data.table . I strongly recommend you use development version of data.table 1.9.7. so you can use the super fast fwrite function to write .csv files. Go here for install instructions.

library(data.table)
setDT(Sales_2015)[, fwrite(.SD, paste0("Sales_2015_", ID,".csv")), 
                              by = AREA, .SDcols=names(Sales_2015)]

also, I would recommend you read your data using fread{data.table} , which is waaaay faster than read.table

Sales_2015 <- fread("C:/address to your file/Sales_2015.txt")

R looping through 20 million rows

Question

1 answers

solution1
5 ACCPTED 2016-06-19 00:39:38

R looping through 20 million rows

Question

1 answers

solution1 5 ACCPTED 2016-06-19 00:39:38

solution1
5 ACCPTED 2016-06-19 00:39:38