pdfbox：如何克隆页面

Question

Using Apache PDFBox, I am editing an existing document and I would like to take one page from that document and simply clone it, copying whatever elements it contains.使用 Apache PDFBox，我正在编辑一个现有文档，我想从该文档中取出一页并简单地克隆它，复制它包含的任何元素。 As an additional twist, I would like to get a reference to all the PDField s for any form fields in this newly cloned page.作为额外的转折，我想获得对这个新克隆页面中任何表单字段的所有PDField的引用。 Here's the code I tried so far:这是我到目前为止尝试过的代码：

            PDPage newPage = new PDPage(lastPage.getCOSDictionary());
            PDFCloneUtility cloner = new PDFCloneUtility(pdfDoc);
            pdfDoc.addPage(newPage);
            cloner.cloneMerge(lastPage, newPage);

            // there doesn't seem to be an API to read the fields from the page, need to filter them out from the document.
            List<PDField> newFields = readPdfFields(pdfDoc);
            Iterator<PDField> i = newFields.iterator();
            while (i.hasNext()) {
                if (i.next().getWidget().getPage() != newPage)
                    i.remove();
            }

readPdfFields is a helper method I wrote to get all the fields in a document using the AcroForm. readPdfFields是我编写的一个辅助方法，用于使用 AcroForm 获取文档中的所有字段。

But this code seems to lead to some kind of crash/hang state in my JVM - I haven't been able to debug exactly what's happening but I'm guessing this is not actually the right way to clone a page.但是这段代码似乎导致我的 JVM 中出现某种崩溃/挂起状态 - 我无法准确调试正在发生的事情，但我猜这实际上并不是克隆页面的正确方法。 What is?什么是？

Answer 1

The least resource intensive way to clone a page is a shallow copy of the corresponding dictionary:克隆页面最不占用资源的方式是对应字典的浅拷贝：

PDDocument doc = PDDocument.load( file );

List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();

PDPage page = allPages.get(0);
COSDictionary pageDict = page.getCOSDictionary();
COSDictionary newPageDict = new COSDictionary(pageDict);

newPageDict.removeItem(COSName.ANNOTS);

PDPage newPage = new PDPage(newPageDict);
doc.addPage(newPage);

doc.save( outfile );

I explicitly deleted the annotations (form fields etc) of the copy because an annotation has a reference pointing back to its page which in the copied page obviously is wrong.我明确删除了副本的注释（表单字段等），因为注释有一个指向其页面的引用，这在复制的页面中显然是错误的。

Thus, if you want the annotations to come along in a clean way, you have to create shallow copies of the annotations array and all contained annotation dictionaries, too, and replace the page reference therein.因此，如果您希望注释以干净的方式出现，您必须创建注释数组和所有包含的注释字典的浅拷贝，并替换其中的页面引用。

Most PDF reader would not mind, though, if the page references are incorrect.但是，如果页面引用不正确，大多数 PDF 阅读器不会介意。 For a dirty solution, therefore, you could simply leave the annotations in the page dictionary.因此，对于肮脏的解决方案，您可以简单地将注释留在页面字典中。 But who wants to be dirty... ;)但是谁想变脏...;）

If you want to additionally change some parts of the new or the old page, you obviously also have to copy the respective PDF objects before manipulating them.如果您想额外更改新页面或旧页面的某些部分，您显然还必须在操作之前复制相应的 PDF 对象。

Some other remarks:其他一些注意事项：

Your original page cloning to me looks weird.你的原始页面克隆给我看起来很奇怪。 After all you add the identical page dictionary to the document again (duplicate entries in the page tree are ignored, I think) and then do some merge between these identical page objects.毕竟，您再次将相同的页面字典添加到文档中（我认为页面树中的重复条目将被忽略），然后在这些相同的页面对象之间进行一些合并。

I assume the PDFCloneUtility is meant for cloning between different documents, not inside the same, but merging a dictionary into itself does not need to work.我认为PDFCloneUtility用于在不同文档之间进行克隆，而不是在同一文档内进行克隆，但不需要将字典合并到自身中。

I would like to get a reference to all the PDFields for any form fields in this newly cloned page我想获得对这个新克隆页面中任何表单字段的所有 PDField 的引用

As the fields have the same name, they are identical!由于字段具有相同的名称，因此它们是相同的！

Fields in PDF are abstract fields which can have many appearances spread over the document. PDF 中的字段是抽象字段，可以在文档中分布多个外观。 The same name implies the same field.同名意味着同一个领域。

A field appearing on some page means that there is an annotation representing that field on the page.某个页面上出现的字段意味着页面上有一个表示该字段的注释。 To make things more complicated, field dictionary and annotation dictionary can be merged for fields with one appearance only.为了使事情变得更复杂，对于只有一种外观的字段，可以合并字段字典和注释字典。

Thus, depending on your requirements you will first have to decide whether you want to work with fields or with field annotations.因此，根据您的要求，您首先必须决定是要使用字段还是使用字段注释。

Answer 2

I found out how to clone pages correctly.我发现了如何正确克隆页面。 But links in table of contents don't work, if that's important to you.但是目录中的链接不起作用，如果这对您很重要。

private static void createDocument(File outputFile) {
    try {
        PDDocument outputDoc = new PDDocument();
        outputDoc = new PDDocument();
        outputDoc.getDocument().setVersion(originalDocument.getDocument().getVersion());
        outputDoc.setDocumentInformation(originalDocument.getDocumentInformation());
        outputDoc.getDocumentCatalog().setViewerPreferences(originalDocument.getDocumentCatalog().getViewerPreferences());

        PDFCloneUtility cloner = new PDFCloneUtility(outputDoc);
        for (PDPage originalPage : originalDocument.getPages()) {
            COSDictionary pageDictionary = (COSDictionary) cloner.cloneForNewDocument(originalPage);
            PDPage page = new PDPage(pageDictionary);
            outputDoc.addPage(page);
        }

        outputFile.delete();
        outputDoc.save(outputFile);
        outputDoc.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

pdfbox：如何克隆页面

问题描述

2 个解决方案

解决方案1
14 已采纳 2013-11-20 11:25:42

解决方案2
1 2020-11-05 17:31:06

pdfbox：如何克隆页面

问题描述

2 个解决方案

解决方案1 14 已采纳 2013-11-20 11:25:42

解决方案2 1 2020-11-05 17:31:06

解决方案1
14 已采纳 2013-11-20 11:25:42

解决方案2
1 2020-11-05 17:31:06