简体   繁体   English

如何提高在 Windows 10 上运行 R 脚本的基于 Linux 的 Docker 桌面容器的性能?

[英]How can I improve the performance of Linux-based Docker Desktop containers running R scripts on Windows 10?

I'd like to be able to get the same performance with docker as I get in RStudio.我希望能够在 docker 中获得与在 RStudio 中相同的性能。 I have Docker Desktop installed on Windows 10 and am using Linux containers.我在 Windows 10 上安装了 Docker 桌面并且正在使用 Linux 容器。 The goal is to containerize R scripts for general use.目标是将 R 脚本容器化以供一般使用。 An R script dtbenchmark.R (adapted from the data.table benchmark script by Matt Dowle ), that encapsulates the problem that I'm having, is一个 R 脚本dtbenchmark.R (改编自Matt Dowle的 data.table 基准脚本),它封装了我遇到的问题,是

library(data.table)
K <- 100L
rows <- c(1e7L, 1:7*1e8L)
for (i in 1:length(rows)) {
  tme <- proc.time()
  N <- rows[i]
  set.seed(1)
  DT <- data.table(
    id1 = sample(sprintf("id%03d",1:K), N, TRUE),       # large groups (char)
    id2 = sample(sprintf("id%03d",1:K), N, TRUE),       # large groups (char)
    id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),  # small groups (char)
    id4 = sample(K, N, TRUE),                           # large groups (int)
    id5 = sample(K, N, TRUE),                           # large groups (int)
    id6 = sample(N/K, N, TRUE),                         # small groups (int)
    v1 =  sample(5, N, TRUE),                           # int in range [1,5]
    v2 =  sample(5, N, TRUE),                           # int in range [1,5]
    v3 =  sample(round(runif(100,max=100),4), N, TRUE)) # numeric e.g. 23.5749
  GB <- round(sum(gc()[,2])/1024, 3)
  rt <- round(proc.time() - tme, 2)
  print(paste0('i = ', i, ' N = ', N, ' K = ', K, ' GB = ', GB, ' seconds = ', rt[3]), quote = FALSE)
  rm(N, DT, GB, rt)
}

The Dockerfile is Dockerfile

FROM rocker/r-ver:3.4.3
RUN Rscript -e "install.packages('https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.0.tar.gz', repo = NULL, type = 'source')" 
COPY . /root
WORKDIR /root
CMD ["Rscript", "dtbenchmark.R"]

In RStudio, the script dtbenchmark.R is able to get through five loops, before exiting with an error message, as in在 RStudio 中,脚本dtbenchmark.R能够通过五个循环,然后退出并显示错误消息,如

[1] i = 1 N = 10000000 K = 100 GB = 0.532 seconds = 2.64
[1] i = 2 N = 100000000 K = 100 GB = 4.954 seconds = 44.58
[1] i = 3 N = 200000000 K = 100 GB = 9.868 seconds = 170.53
[1] i = 4 N = 300000000 K = 100 GB = 14.778 seconds = 426.42
[1] i = 5 N = 400000000 K = 100 GB = 19.688 seconds = 1013.77
Error: cannot allocate vector of size 3.7 Gb

With the Dockerfile and dtbenchmark.R in the same folder, in Windows PowerShell the docker command in that folder to build the image is Dockerfiledtbenchmark.R位于同一文件夹中,在 Windows PowerShell 中,该文件夹中用于构建映像的docker命令是

docker build -t dtbenchmark .

Then the docker command in Windows PowerShell to run the container is然后在 Windows PowerShell 中运行容器的 docker 命令是

docker run --rm dtbenchmark:latest

In PowerShell, the container only gets through three loops, before exiting with no message, as in在 PowerShell 中,容器仅通过三个循环,然后在没有消息的情况下退出,如

[1] i = 1 N = 10000000 K = 100 GB = 0.515 seconds = 2.08
[1] i = 2 N = 100000000 K = 100 GB = 4.937 seconds = 41.3
[1] i = 3 N = 200000000 K = 100 GB = 9.851 seconds = 91.81

My laptop has Windows 10 Enterprise, 48 GB of RAM and a 64-bit OS.我的笔记本电脑装有 Windows 10 Enterprise、48 GB 内存和 64 位操作系统。 I'm not able to run as administrator.我无法以管理员身份运行。

So I'm completely unfamiliar with this process but from a Powershell standpoint, when I need a process to complete quickly I always run a foreach loop and process in parallel.所以我完全不熟悉这个过程,但从 Powershell 的角度来看,当我需要一个过程快速完成时,我总是运行一个 foreach 循环并并行处理。 By default Powershell will process 5 loops in parallel at a time but you could experiment with upping that number.默认情况下,Powershell 将一次并行处理 5 个循环,但您可以尝试增加该数量。

Possibly:可能:

    foreach -parallel -throttlelimit 5 ($container in $containers){ 

          #do something

   }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM