Handeling large datasets in R

Question

I'm working on a relatively large datasets (5 files 2GB each to give you an order of magnitude one of the tables is 1.5M rows x 270columns), where I use dplyr left_joint funtion (between these datasets and other small tables). The tables contain string data that I don't want to lose. However using packages that handle large datasets (like bigmemory or ff) converts the string to factors and then to numbers which means that the data is lost. Is there a way to manipulate those files (with my 8GB of RAM) whitout losing information ?

Answer 1

I don't understand when you say that the info are lost when using factors. For example, say that str is one of your string column, you can do

str <- sample(sample(letters, replace = TRUE), 
              size = 1.5e6, replace = TRUE)
tab.str <- sort(unique(str)) # could use `letters` as lookup table
str.int <- match(str, tab.str)
all.equal(tab.str[str.int], str)

So, basically you have integers that are the indices of a lookup table to get back your strings.

However, if you use a big.matrix format, you won't be able to use dplyr , but I think that it would be relatively easy to reimplement a left join for your particular case.

Answer 2

Explore Data.tables for any kind of processing on R with large datasets. The speed and efficiency is unparallel compared to any other data handling package on R.

Handeling large datasets in R

Question

2 answers

solution1
1 2017-04-28 06:48:54

solution2
0 2017-04-28 07:06:20

Handeling large datasets in R

Question

2 answers

solution1 1 2017-04-28 06:48:54

solution2 0 2017-04-28 07:06:20

solution1
1 2017-04-28 06:48:54

solution2
0 2017-04-28 07:06:20