简体   繁体   中英

How can I improve the performance of Linux-based Docker Desktop containers running R scripts on Windows 10?

I'd like to be able to get the same performance with docker as I get in RStudio. I have Docker Desktop installed on Windows 10 and am using Linux containers. The goal is to containerize R scripts for general use. An R script dtbenchmark.R (adapted from the data.table benchmark script by Matt Dowle ), that encapsulates the problem that I'm having, is

library(data.table)
K <- 100L
rows <- c(1e7L, 1:7*1e8L)
for (i in 1:length(rows)) {
  tme <- proc.time()
  N <- rows[i]
  set.seed(1)
  DT <- data.table(
    id1 = sample(sprintf("id%03d",1:K), N, TRUE),       # large groups (char)
    id2 = sample(sprintf("id%03d",1:K), N, TRUE),       # large groups (char)
    id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),  # small groups (char)
    id4 = sample(K, N, TRUE),                           # large groups (int)
    id5 = sample(K, N, TRUE),                           # large groups (int)
    id6 = sample(N/K, N, TRUE),                         # small groups (int)
    v1 =  sample(5, N, TRUE),                           # int in range [1,5]
    v2 =  sample(5, N, TRUE),                           # int in range [1,5]
    v3 =  sample(round(runif(100,max=100),4), N, TRUE)) # numeric e.g. 23.5749
  GB <- round(sum(gc()[,2])/1024, 3)
  rt <- round(proc.time() - tme, 2)
  print(paste0('i = ', i, ' N = ', N, ' K = ', K, ' GB = ', GB, ' seconds = ', rt[3]), quote = FALSE)
  rm(N, DT, GB, rt)
}

The Dockerfile is

FROM rocker/r-ver:3.4.3
RUN Rscript -e "install.packages('https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.0.tar.gz', repo = NULL, type = 'source')" 
COPY . /root
WORKDIR /root
CMD ["Rscript", "dtbenchmark.R"]

In RStudio, the script dtbenchmark.R is able to get through five loops, before exiting with an error message, as in

[1] i = 1 N = 10000000 K = 100 GB = 0.532 seconds = 2.64
[1] i = 2 N = 100000000 K = 100 GB = 4.954 seconds = 44.58
[1] i = 3 N = 200000000 K = 100 GB = 9.868 seconds = 170.53
[1] i = 4 N = 300000000 K = 100 GB = 14.778 seconds = 426.42
[1] i = 5 N = 400000000 K = 100 GB = 19.688 seconds = 1013.77
Error: cannot allocate vector of size 3.7 Gb

With the Dockerfile and dtbenchmark.R in the same folder, in Windows PowerShell the docker command in that folder to build the image is

docker build -t dtbenchmark .

Then the docker command in Windows PowerShell to run the container is

docker run --rm dtbenchmark:latest

In PowerShell, the container only gets through three loops, before exiting with no message, as in

[1] i = 1 N = 10000000 K = 100 GB = 0.515 seconds = 2.08
[1] i = 2 N = 100000000 K = 100 GB = 4.937 seconds = 41.3
[1] i = 3 N = 200000000 K = 100 GB = 9.851 seconds = 91.81

My laptop has Windows 10 Enterprise, 48 GB of RAM and a 64-bit OS. I'm not able to run as administrator.

So I'm completely unfamiliar with this process but from a Powershell standpoint, when I need a process to complete quickly I always run a foreach loop and process in parallel. By default Powershell will process 5 loops in parallel at a time but you could experiment with upping that number.

Possibly:

    foreach -parallel -throttlelimit 5 ($container in $containers){ 

          #do something

   }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM