復制時頁面在PDFBox中的新文檔中被裁剪

Question

我正在嘗試將單個PDF拆分為多個。 就像將10頁文檔轉換成10個單頁文檔。

PDDocument source = PDDocument.load(input_file);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(file);
output.close();

這里的問題是，新文檔的頁面大小與原始文檔不同。 因此，新文檔中某些文本被裁剪或丟失。 我正在使用PDFBox 2.0，如何避免這種情況？

更新：謝謝@mkl。

斯普利特做了魔術。 這是更新的工作部分，

public static void extractAndCreateDocument(SplitMeta meta, PDDocument source)
      throws IOException {

    File file = new File(meta.getFilename());

    Splitter splitter = new Splitter();
    splitter.setStartPage(meta.getStart());
    splitter.setEndPage(meta.getEnd());
    splitter.setSplitAtPage(meta.getEnd());

    List<PDDocument> docs = splitter.split(source);
    if(docs.size() > 0){
      PDDocument output = docs.get(0);
      output.save(file);
      output.close();
    }
  }

public class SplitMeta {

  private String filename;
  private int start;
  private int end;

  public SplitMeta() {
  }
}

Answer 1

不幸的是，OP沒有提供樣本文件來重現該問題。 因此，我不得不猜測。

我認為問題出在不是立即鏈接到頁面對象而是繼承自其父對象的對象。

在那種情況下，使用PDDocument.addPage是錯誤的選擇，因為此方法僅將給定的頁面對象添加到目標文檔頁面樹中，而不考慮繼承的內容。

相反，應該使用PDDocument.importPage記錄為：

/**
 * This will import and copy the contents from another location. Currently the content stream is stored in a scratch
 * file. The scratch file is associated with the document. If you are adding a page to this document from another
 * document and want to copy the contents to this document's scratch file then use this method otherwise just use
 * the {@link #addPage} method.
 * 
 * Unlike {@link #addPage}, this method does a deep copy. If your page has annotations, and if
 * these link to pages not in the target document, then the target document might become huge.
 * What you need to do is to delete page references of such annotations. See
 * <a href="http://stackoverflow.com/a/35477351/535646">here</a> for how to do this.
 *
 * @param page The page to import.
 * @return The page that was imported.
 * 
 * @throws IOException If there is an error copying the page.
 */
public PDPage importPage(PDPage page) throws IOException

實際上，即使此方法也不能滿足要求，因為它沒有考慮所有繼承的屬性，但是查看Splitter實用程序類，您會印象深刻，該做什么：

PDPage imported = getDestinationDocument().importPage(page);
imported.setCropBox(page.getCropBox());
imported.setMediaBox(page.getMediaBox());
// only the resources of the page will be copied
imported.setResources(page.getResources());
imported.setRotation(page.getRotation());
// remove page links to avoid copying not needed resources 
processAnnotations(imported);

利用助手方法

private void processAnnotations(PDPage imported) throws IOException
{
    List<PDAnnotation> annotations = imported.getAnnotations();
    for (PDAnnotation annotation : annotations)
    {
        if (annotation instanceof PDAnnotationLink)
        {
            PDAnnotationLink link = (PDAnnotationLink)annotation;   
            PDDestination destination = link.getDestination();
            if (destination == null && link.getAction() != null)
            {
                PDAction action = link.getAction();
                if (action instanceof PDActionGoTo)
                {
                    destination = ((PDActionGoTo)action).getDestination();
                }
            }
            if (destination instanceof PDPageDestination)
            {
                // TODO preserve links to pages within the splitted result  
                ((PDPageDestination) destination).setPage(null);
            }
        }
        // TODO preserve links to pages within the splitted result  
        annotation.setPage(null);
    }
}

當您嘗試將單個PDF拆分為多個文檔（例如將10頁文檔拆分為10個單頁文檔）時，您可能希望按原樣使用此Splitter實用程序類。

測試

為了測試這些方法，我使用了PDF Clown示例輸出AnnotationSample.Standard.pdf的輸出，因為該庫在很大程度上取決於頁面樹值的繼承。 因此，我使用PDDocument.addPage ， PDDocument.importPage或Splitter這樣將其唯一頁面的內容復制到新文檔中：

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(new File(RESULT_FOLDER, "PageAddedFromAnnotationSample.Standard.pdf"));
output.close();

（ CopyPages.java測試testWithAddPage ）

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.importPage(page);
output.save(new File(RESULT_FOLDER, "PageImportedFromAnnotationSample.Standard.pdf"));
output.close();

（ CopyPages.java測試testWithImportPage ）

PDDocument source = PDDocument.load(resource);
Splitter splitter = new Splitter();
List<PDDocument> results = splitter.split(source);
Assert.assertEquals("Expected exactly one result document from splitting a single page document.", 1, results.size());
PDDocument output = results.get(0);
output.save(new File(RESULT_FOLDER, "PageSplitFromAnnotationSample.Standard.pdf"));
output.close();

（ CopyPages.java測試testWithSplitter ）

只有最終測試能忠實地復制頁面。

復制時頁面在PDFBox中的新文檔中被裁剪

問題描述

1 個解決方案

解決方案1
4 已采納 2016-05-30 15:17:38

測試

復制時頁面在PDFBox中的新文檔中被裁剪

問題描述

1 個解決方案

解決方案1 4 已采納 2016-05-30 15:17:38

測試

解決方案1
4 已采納 2016-05-30 15:17:38