在 R 中阅读 PDF 组合

Question

Is it possible to read/convert PDF portfolios in R?是否可以读取/转换 R 中的 PDF 投资组合？

I usually use pdftools , however, I get an error:我通常使用pdftools ，但是，我得到一个错误：

library(pdftools)
#> Using poppler version 0.73.0

link <- c("http://www.accessdata.fda.gov/cdrh_docs/pdf19/K190072.pdf")

pdftools::pdf_convert(link, dpi = 600)
#> Converting page 1 to K190072_1.png...
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#>  done!
#> [1] "K190072_1.png"

^{Created on 2021-05-06 by the reprex package (v1.0.0)}^{由代表 package (v1.0.0) 于 2021 年 5 月 6 日创建}

The K190072_1.png I finally get is only the image of the portfolio front page.我最终得到的K190072_1.png只是投资组合首页的图像。

I am interessted in the document K190072.510kSummary.Final_Sent001.pdf of this PDF portfolio我对这个 PDF 产品组合的文档K190072.510kSummary.Final_Sent001.pdf感兴趣

I found a way for Python ( Reading a PDF Portfolio in Python? ) but I would really like to do that in R. I found a way for Python ( Reading a PDF Portfolio in Python? ) but I would really like to do that in R.

Thank you for your help.谢谢您的帮助。

Answer 1

There seems to be an issue with pdf_convert handling one-page raw pdf data (it wants to use basename(pdf) under these conditions), so I have edited that function so that it also works with the second attached pdf file. pdf_convert处理一页原始 pdf 数据似乎存在问题（它想在这些条件下使用basename(pdf) ），所以我编辑了 function 以便它也适用于第二个附加的 Z43700475BA4193714E4 文件。

If you only need the first file then you could run this with the original pdf_convert function, but it will give an error with the second file.如果您只需要第一个文件，那么您可以使用原始的pdf_convert function 运行它，但第二个文件会出错。

If you are interested in rendering raster graphics from the attached files this worked for me:如果您有兴趣从附件中渲染光栅图形，这对我有用：

library(pdftools)
#> Using poppler version 21.02.0
link <- c("http://www.accessdata.fda.gov/cdrh_docs/pdf19/K190072.pdf")

pdf_convert <- function (pdf, format = "png", pages = NULL, filenames = NULL, 
          dpi = 72, antialias = TRUE, opw = "", upw = "", verbose = TRUE) {
    config <- poppler_config()
    if (!config$can_render || !length(config$supported_image_formats)) 
        stop("You version of libppoppler does not support rendering")
    format <- match.arg(format, poppler_config()$supported_image_formats)
    if (is.null(pages)) 
        pages <- seq_len(pdf_info(pdf, opw = opw, upw = upw)$pages)
    if (!is.numeric(pages) || !length(pages)) 
        stop("Argument 'pages' must be a one-indexed vector of page numbers")
    if (length(filenames) < 2 & !is.raw(pdf)) {   # added !is.raw(pdf)
        input <- sub(".pdf", "", basename(pdf), fixed = TRUE)
        filenames <- if (length(filenames)) {
            sprintf(filenames, pages, format)
        }
        else {
            sprintf("%s_%d.%s", input, pages, format)
        }
    }
    if (length(filenames) != length(pages)) 
        stop("Length of 'filenames' must be one or equal to 'pages'")
    antialiasing <- isTRUE(antialias) || isTRUE(antialias == 
                                                    "draw")
    text_antialiasing <- isTRUE(antialias) || isTRUE(antialias == 
                                                         "text")
    pdftools:::poppler_convert(pdftools:::loadfile(pdf), format, pages, filenames, 
                    dpi, opw, upw, antialiasing, text_antialiasing, verbose)
}

lapply(pdf_attachments(link), function(x) pdf_convert(x$data, 
    filenames=paste0(tools::file_path_sans_ext(x$name), "-", 
                     seq_along(pdf_data(x$data)), ".png")))
#> Converting page 1 to K190072.510kSummary.Final_Sent001-1.png... done!
#> Converting page 2 to K190072.510kSummary.Final_Sent001-2.png... done!
#> Converting page 3 to K190072.510kSummary.Final_Sent001-3.png... done!
#> Converting page 4 to K190072.510kSummary.Final_Sent001-4.png... done!
#> Converting page 5 to K190072.510kSummary.Final_Sent001-5.png... done!
#> Converting page 1 to K190072.IFU.FINAL_Sent001-1.png... done!
#> Converting page 1 to K190072.Letter.SE.FINAL_Sent001-1.png... done!
#> Converting page 2 to K190072.Letter.SE.FINAL_Sent001-2.png... done!
#> [[1]]
#> [1] "K190072.510kSummary.Final_Sent001-1.png"
#> [2] "K190072.510kSummary.Final_Sent001-2.png"
#> [3] "K190072.510kSummary.Final_Sent001-3.png"
#> [4] "K190072.510kSummary.Final_Sent001-4.png"
#> [5] "K190072.510kSummary.Final_Sent001-5.png"
#> 
#> [[2]]
#> [1] "K190072.IFU.FINAL_Sent001-1.png"
#> 
#> [[3]]
#> [1] "K190072.Letter.SE.FINAL_Sent001-1.png"
#> [2] "K190072.Letter.SE.FINAL_Sent001-2.png"

^{Created on 2021-05-05 by the reprex package (v2.0.0)}^{由代表 package (v2.0.0) 于 2021 年 5 月 5 日创建}

在 R 中阅读 PDF 组合

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-06 03:27:38

在 R 中阅读 PDF 组合

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-06 03:27:38

解决方案1
2 已采纳 2021-05-06 03:27:38