繁体   English   中英

为什么使用 R 的 foreach() %dopar% 构造打印的 pdf 文件会损坏且无法读取?

[英]Why do pdf files that are printed using R's foreach() %dopar% construct turn out corrupted and unreadable?

我在下面有一个最小的可重现示例脚本,它将相同的图写入两个 pdf 文件,首先是串行的,使用标准 for 循环,然后是并行的,使用 R 的foreach() %dopar%构造:

library(ggplot2)
library(parallel)
library(doParallel)
library(foreach)

# Print an arbitrary dummy plot (from the standard "cars" data set) to a
# specific integer graphical device number.
makeplot <- function(graph_dev) {
  dev.set(graph_dev)
  plt <- ggplot(cars) + geom_point(aes(x=speed, y=dist))
  # Print the same plot repeatedly 10 times, on 10 sequential pages, in
  # order to purposefully bloat up the file size a bit and convince
  # ourselves that actual plot content is really being saved to the file.
  for(ii in seq(10)) {print(plt)}
}

# A pair of pdf files that we will write serially, on a single processor
fser <- c('test_serial_one.pdf', 'test_serial_two.pdf')

# A pair of pdf files that we will write in parallel, on two processors
fpar <- c('test_parallel_one.pdf', 'test_parallel_two.pdf')

# Open all four pdf files, and generate a key-value pair assigning each
# file name to an integer graphical device number
fnmap <- list()
for(f in c(fser, fpar)) {
  pdf(f)
  fnmap[[f]] <- dev.cur()
}

# Loop over the first two pdf files using a basic serial "for" loop
for(f in fser) {makeplot(fnmap[[f]])}

# Do the same identical loop content as above, but this time using R's
# parallelization framework, and writing to the second pair of pdf files
registerDoParallel(cl=makeCluster(2, type='FORK'))
foreach(f=fpar) %dopar% {makeplot(fnmap[[f]])}

# Close all four of the pdf files
for(f in names(fnmap)) {
    dev.off(fnmap[[f]])
}

The first two output files, test_serial_one.pdf and test_serial_two.pdf , each have a final file size of 38660 bytes and can be opened and displayed correctly using a standard pdf reader such as Adobe Acrobat Reader or similar.

The second two output files, test_parallel_one.pdf and test_parallel_two.pdf , each have a final file size of 34745 bytes, but they return a file corruption error when attempting to read with standard tools: eg, "There was an error opening this document.此文件无法打开,因为它没有页面。”

串行与并行版本的文件大小大致相等的事实向我表明,来自 pdf 阅读器的错误消息可能不正确:并行循环实际上就像在串行循环中一样成功地将页面内容转储到文件中, 而在并行化 output 文件的页面内容末尾可能缺少某种文件页脚信息,可能是因为这两个文件没有成功关闭。

由于各种技术原因,我希望能够在foreach() %dopar%构造之外打开和关闭多个 pdf 文件,同时在并行循环中使用dev.set()来选择在每个文件上写入哪个文件循环迭代。

本例中并行循环中发生的文件损坏的根本原因是什么? 以及如何更正它:即,如何修改我的代码以正确关闭文件和 append 必要的 pdf 并行循环完成后的文件页脚信息?

尽管分配了不同的文件,但分叉的进程共享一些图形设备管道。 使用 MPI 后端,或将代码编写为 HPC 集群的 SPMD,将为您提供与排名一样多的 R 会话(和图形管道)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM