简体   繁体   中英

Importing very large dataset into h2o from sqlite

I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables. Is there a way around this? My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD. The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.

library(data.table)
library(h2o)          
h2o.init(
  nthreads=14,          
  max_mem_size = "100G")    
h2o.removeAll() 

setwd("/home/stan/Documents/LUR/era_aq")

l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)

test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```

First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.

(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)

Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.

So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.

h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit , but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.

Since memory is a premium and all R runs in RAM, avoid storing large helper data.table and h20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:

# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])

# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)

Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.

build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]

test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))

Extend function with if for some datasets that need to retain first column or not.

build_h2o <- function(f) {
   if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
   else { tmp <- fread(f) }

   return(as.h2o(tmp))
}

Finally, if possible, leverage data.table methods like cbindlist :

final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))

test.h2o <- as.h2o(final_dt)

rm(final_dt)
gc()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM