简体   繁体   English

PDFBox:使用非常大的PDF。

[英]PDFBox: working with very large PDFs.

I am working with some very large PDFs, some over 7GB in size. 我正在使用一些非常大的PDF,一些超过7GB。 The PDFs have up to 20,000 pages and many full page color images. PDF包含多达20,000页和许多整页彩色图像。 I'd like to use PDFBox to work with the PDFs, but due to the size I get OutOfMemoryError's when I attempt to open the PDFs. 我想使用PDFBox来处理PDF,但由于我在尝试打开PDF时出现OutOfMemoryError的大小。

I'm working with version pdfbox-app-1.6.0, on Windows 7 using Intellij, java 6. 我正在使用版本pdfbox-app-1.6.0,在Windows 7上使用Intellij,java 6。

First I tried writing a simple program that just opened the PDF in a PDDocument and coping each page over to another PDDocument: http://ideone.com/arKhB 首先,我尝试编写一个简单的程序,只是在PDDocument中打开PDF并将每个页面复制到另一个PDDocument: http ://ideone.com/arKhB

Next I tried using the PDFBox CopyDoc example. 接下来我尝试使用PDFBox CopyDoc示例。

Both example run out of memory. 两个例子都没有内存。

I'm assuming this is because PDFBox is trying to read the whole document into memory. 我假设这是因为PDFBox试图将整个文档读入内存。 Is there a way to have it only open 1 page at a time? 有没有办法让它一次只打开1页? I know it would be slower processing, but at the moment I can't process anything. 我知道处理速度会慢一些,但目前我无法处理任何事情。

In the 2.0.* versions, open the PDF like this: 在2.0。*版本中,打开PDF如下:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size. 这将设置缓冲内存使用仅使用大小不受限制的临时文件(无主内存)。

Update 17.4.2018: More tricks to save memory are described in the FAQ . 更新17.4.2018: 常见问题解答中描述了更多节省内存的技巧。 Not yet described but active since 2.0.9 is subsampling (skip pixel lines/rows) with PDFRenderer.setSubsamplingAllowed(true) when doing rendering. 尚未描述但有效,因为2.0.9在进行渲染时使用PDFRenderer.setSubsamplingAllowed(true)进行子采样(跳过像素行/行)。 This saves space for PDF files with huge image files. 这为拥有巨大图像文件的PDF文件节省了空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM