在R中运行优化时并行调用目标函数

Question

I'm doing optimization in R. My problem involves running nlm on an objective function which loops over a large list of data. 我在R中进行优化。我的问题涉及在循环大量数据的目标函数上运行nlm 。 I'd like to speed up the optimization by running the objective function in parallel. 我想通过并行运行目标函数来加快优化速度。 How should I go about doing that? 我应该怎么做呢？

In the example below I set up a toy problem in which the parallelized solution is slower than the original. 在下面的示例中，我设置了一个玩具问题，其中并行化解决方案比原始解决方案慢。 How do I modify the code to reduce overhead and speed up the parallelized version of my nlm call? 如何修改代码以减少开销并加快nlm调用的并行化版本？

library(parallel)

## What is the right way to do optimization when the objective function is run in parallel?
## Don't want very_big_list to be copied more than necessary

set.seed(952)

my_objfn <- function(list_element, parameter) {
    return(sum((list_element - parameter) ^ 2))  # Simple example
}

apply_my_objfn_in_parallel <- function(parameter, very_big_list, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    objfn_values <- parLapply(cluster, very_big_list, my_objfn, parameter=parameter)
    stopCluster(cluster)
    return(Reduce("+", objfn_values))
}

apply_my_objfn <- function(parameter, very_big_list) {
    objfn_values <- lapply(very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

my_big_list <- replicate(2 * 10^6, sample(seq_len(100), size=5), simplify=FALSE)
parameter_guess <- 20
mean(c(my_big_list, recursive=TRUE))  # Should be close to 50
system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 84.2 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 63.6 elapsed

I ran this on my laptop (4 CPUs, so the cluster returned by makeCluster(min(max_cores, detectCores() - 1)) has 3 cores). 我在笔记本电脑上运行了这台计算机（4个CPU，因此makeCluster(min(max_cores, detectCores() - 1))返回的群集具有3个内核）。 In the last lines above, apply_my_objfn_in_parallel takes longer than apply_my_objfn . 在上面的最后apply_my_objfn_in_parallel行中， apply_my_objfn_in_parallel花费的时间比apply_my_objfn更长。 I think this is because (1) I only have 3 cores and (2) each time nlm calls the parallelized objective function, it sets up a new cluster and breaks up and copies all of my_big_list . 我认为这是因为（1）我只有3个核心，（2）每次nlm调用并行化的目标函数时，它会建立一个新集群并分解并复制所有my_big_list 。 That seems wasteful -- would I get better results if I somehow set up the cluster and copied the list only once per nlm call? 这似乎很浪费-如果我以某种方式设置集群并在每个nlm调用中仅复制一次列表，我会得到更好的结果吗？ If so, how do I do that? 如果是这样，我该怎么做？

Edit after Erwin's answer ("consider creating and stopping the cluster once instead of in each evaluation"): 在Erwin的答案之后进行编辑（“考虑一次创建并停止集群，而不是在每次评估中考虑”）：

## Modify function to use single cluster per nlm call
apply_my_objfn_in_parallel_single_cluster <- function(parameter, very_big_list, my_cluster) {
    objfn_values <- parLapply(my_cluster, very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

run_nlm_single_cluster <- function(very_big_list, parameter_guess, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    nlm_result <- nlm(apply_my_objfn_in_parallel_single_cluster, parameter_guess,
                      very_big_list=very_big_list, my_cluster=cluster, print.level=0)
    stopCluster(cluster)
    return(nlm_result)
}

system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 49.0 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 36.8 elapsed
system.time(test_single_cluster <- run_nlm_single_cluster(my_big_list,
                                                          parameter_guess))  # 38.4 elapsed

In addition to my laptop (elapsed times in comments above), I ran the code on a server with 30 cores. 除了我的笔记本电脑（上面的注释中经过的时间）之外，我还在具有30个核心的服务器上运行了代码。 There my elapsed times were 107 for apply_my_objfn and 74 for run_nlm_single_cluster . 有我的运行时间分别为107 apply_my_objfn和74 run_nlm_single_cluster 。 I'm surprised that the times were longer than on my puny little laptop, but it makes sense that the single cluster parallel optimization beats the regular non-parallel version when you have more cores. 我感到惊讶的是，时间比我那矮小的便携式笔记本电脑还要长，但是当您拥有更多内核时，单群集并行优化优于常规的非并行版本是有意义的。

Another edit, for completeness (see comments under Erwin's answer): here is a non-parallel solution using analytical gradients. 为了完整起见，另一种编辑（请参见Erwin的回答下的评论）：这是使用解析梯度的非并行解决方案。 Surprisingly, it is slower than with numerical gradients. 令人惊讶的是，它比数字梯度要慢。

## Add gradients
my_objfn_value_and_gradient <- function(list_element, parameter) {
    return(c(sum((list_element - parameter) ^ 2), -2*sum(list_element - parameter)))
}

apply_my_objfn_with_gradient <- function(parameter, very_big_list) {
    ## Returns objfn value with gradient attribute, see ?nlm
    objfn_values_and_grads <- lapply(very_big_list, my_objfn_value_and_gradient, parameter=parameter)
    objfn_value_and_grad <- Reduce("+", objfn_values_and_grads)
    stopifnot(length(objfn_value_and_grad) == 2)  # First is objfn value, second is gradient
    objfn_value <- objfn_value_and_grad[1]
    attr(objfn_value, "gradient") <- objfn_value_and_grad[2]
    return(objfn_value)
}

system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 37.4 elapsed
system.time(test_regular_grad <- nlm(apply_my_objfn_with_gradient, parameter_guess,
                                     very_big_list=my_big_list, print.level=0,
                                     check.analyticals=FALSE))  # 45.0 elapsed

I'd be curious to know what's going on here. 我很想知道这是怎么回事。 That said, my question is still How can I speed up this sort of optimization problem using parallelization? 也就是说，我的问题仍然是如何使用并行化来加速这种优化问题？

Answer 1

Looks to me there is too much overhead in the parallel function evaluation to make it worthwhile. 在我看来，并行函数评估有太多开销，因此不值得。 Consider creating and stopping the cluster once instead of in each evaluation. 考虑一次创建和停止集群，而不是在每次评估中。 Also I believe you don't provide gradients so the solver will likely do finite differences, which can lead to a large number of function evaluation calls. 我也相信您不会提供梯度，因此求解器可能会产生有限的差异，这可能导致大量的函数求值调用。 You may want to consider providing gradients. 您可能要考虑提供渐变。

在R中运行优化时并行调用目标函数

问题描述

1 个解决方案

解决方案1
2 2016-10-09 11:45:44

在R中运行优化时并行调用目标函数

问题描述

1 个解决方案

解决方案1 2 2016-10-09 11:45:44

解决方案1
2 2016-10-09 11:45:44