[英]R neuralnet package too slow for millions of records
I am trying to train a neural network for churn prediction with R package neuralnet. 我正在尝试使用R包Neuronet训练神经网络进行客户流失预测。 Here is the code:
这是代码:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. 它可以正常工作,但是在训练完整的数据集(数百万行的范围)时,它花费的时间太长。 So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores.
所以我知道R默认情况下是单线程的,所以我尝试研究如何将工作并行化到所有内核中。 Is it even possible to make this function in parallel?
甚至可以并行执行此功能吗? I have tried various packages with no success.
我尝试过各种包装,但均未成功。
Has anyone been able to do this? 有人能这样做吗? It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
它不一定是Neuronet软件包,任何可以让我训练神经网络的解决方案都可以。
Thank you 谢谢
I have had good experiences with the package Rmpi , and it may be applicable in your case too. 我对Rmpi软件包有很好的经验,它可能也适用于您的情况。
library(Rmpi)
Briefly, its usage is as follows: 简而言之,其用法如下:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. 您可以尝试使用插入符号和doParallel软件包。 This is what I have been using.
这就是我一直在使用的。 It works for some of the model types but may not work for all.
它适用于某些模型类型,但可能不适用于所有模型类型。
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.