简体   繁体   中英

Handeling large datasets in R

I'm working on a relatively large datasets (5 files 2GB each to give you an order of magnitude one of the tables is 1.5M rows x 270columns), where I use dplyr left_joint funtion (between these datasets and other small tables). The tables contain string data that I don't want to lose. However using packages that handle large datasets (like bigmemory or ff) converts the string to factors and then to numbers which means that the data is lost. Is there a way to manipulate those files (with my 8GB of RAM) whitout losing information ?

I don't understand when you say that the info are lost when using factors. For example, say that str is one of your string column, you can do

str <- sample(sample(letters, replace = TRUE), 
              size = 1.5e6, replace = TRUE)
tab.str <- sort(unique(str)) # could use `letters` as lookup table
str.int <- match(str, tab.str)
all.equal(tab.str[str.int], str)

So, basically you have integers that are the indices of a lookup table to get back your strings.

However, if you use a big.matrix format, you won't be able to use dplyr , but I think that it would be relatively easy to reimplement a left join for your particular case.

Explore Data.tables for any kind of processing on R with large datasets. The speed and efficiency is unparallel compared to any other data handling package on R.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM