如何使用pdfbox或其他java库减小合并的PDF / A-1b文件的大小

Question

Input : A list of (eg 14) PDF/A-1b files with embedded fonts. 输入：包含嵌入字体的（例如14个）PDF / A-1b文件列表。
Processing : Doing a simple merge with Apache PDFBOX. 处理：与Apache PDFBOX进行简单合并。
Result : 1 PDF/A-1b file with large (too large) file size. 结果：1个文件大小（太大）的PDF / A-1b文件。 (It is almost the sum of the size of all the source files). （它几乎是所有源文件大小的总和）。

Question : Is there a way to reduce the file size of the resulting PDF? 问题：有没有办法减少生成的PDF的文件大小？
Idea : Remove redundant embedded fonts. 想法：删除冗余的嵌入字体。 But how to? 但是怎么样？ And is it the right way to do? 这是正确的方法吗？

Unfortunately the following code is not doing the job, but is highlighting the obvious problem. 不幸的是，以下代码没有完成这项工作，但突出了明显的问题。

try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
    List<COSName> collectedFonts = new ArrayList<>();
    PDPageTree pages = document.getDocumentCatalog().getPages();
    int pageNr = 0;
    for (PDPage page : pages) {
        pageNr++;
        Iterable<COSName> names = page.getResources().getFontNames();
        System.out.println("Page " + pageNr);
        for (COSName name : names) {
            collectedFonts.add(name);
            System.out.print("\t" + name + " - ");
            PDFont font = page.getResources().getFont(name);
            System.out.println(font + ", embedded: " + font.isEmbedded());
            page.getCOSObject().removeItem(COSName.F);
            page.getResources().getCOSObject().removeItem(name);
        }
    }
    document.save("E:/tmp/output.pdf");
}

The code produces an output like that: 代码产生如下输出：

Page 1
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true

Any help appreciated ... 任何帮助赞赏...

Answer 1

The code in this answer is an attempt to optimize documents like the OP's example document, ie documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. 本回答中的代码是尝试优化文档，例如OP的示例文档，即包含完全相同对象的副本的文档，在完全相同的完全嵌入字体的情况下。 It does not merge merely nearly identical objects, eg multiple subsets of the same font into one single union subset. 它不仅仅将几乎相同的对象合并，例如，将相同字体的多个子集合并为一个单个联合子集。

In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. 在对问题的评论过程中，很明显OP的PDF中的重复字体确实是源字体文件的完整副本。 To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates. 要合并这些重复对象，必须收集文档的复杂对象（数组，字典，流），将它们相互比较，然后合并重复项。

As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash. 由于文档的所有复杂对象的实际成对比较在大型文档的情况下可能花费太多时间，因此以下代码计算这些对象的散列并且仅比较具有相同散列的对象。

To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. 要合并重复项，代码将选择其中一个副本，并使用对所选副本的引用替换对任何其他重复项的所有引用，从文档对象池中删除其他重复项。 To do this more effectively, the code initially not only collects all complex objects but also all references to each of them. 为了更有效地执行此操作，代码最初不仅收集所有复杂对象，还收集对每个对象的所有引用。

The optimization code 优化代码

This is the method to call to optimize a PDDocument : 这是调用优化PDDocument ：

public void optimize(PDDocument pdDocument) throws IOException {
    Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
    for (int pass = 0; ; pass++) {
        int merges = mergeDuplicates(complexObjects);
        if (merges <= 0) {
            System.out.printf("Pass %d - No merged objects\n\n", pass);
            break;
        }
        System.out.printf("Pass %d - Merged objects: %d\n\n", pass, merges);
    }
}

( OptimizeAfterMerge method under test) （正在测试的OptimizeAfterMerge方法）

The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged. 优化需要多次传递，因为某些对象的相等性只能在它们引用的重复项被合并之后被识别。

The following helper methods and classes collect the complex objects of a PDF and the references to each of them: 以下帮助器方法和类收集PDF的复杂对象以及对每个对象的引用：

Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
    COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
    Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
    incomingReferences.put(catalogDictionary, new ArrayList<>());

    Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
    Set<COSBase> thisPass = new HashSet<>();
    while(!lastPass.isEmpty()) {
        for (COSBase object : lastPass) {
            if (object instanceof COSArray) {
                COSArray array = (COSArray) object;
                for (int i = 0; i < array.size(); i++) {
                    addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
                }
            } else if (object instanceof COSDictionary) {
                COSDictionary dictionary = (COSDictionary) object;
                for (COSName key : dictionary.keySet()) {
                    addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
                }
            }
        }
        lastPass = thisPass;
        thisPass = new HashSet<>();
    }
    return incomingReferences;
}

void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
    COSBase object = reference.getTo();
    if (object instanceof COSArray || object instanceof COSDictionary) {
        Collection<Reference> incoming = incomingReferences.get(object);
        if (incoming == null) {
            incoming = new ArrayList<>();
            incomingReferences.put(object, incoming);
            thisPass.add(object);
        }
        incoming.add(reference);
    }
}

( OptimizeAfterMerge helper methods findComplexObjects and addTarget ) （ OptimizeAfterMerge辅助方法findComplexObjects和addTarget ）

interface Reference {
    public COSBase getFrom();

    public COSBase getTo();
    public void setTo(COSBase to);
}

static class ArrayReference implements Reference {
    public ArrayReference(COSArray array, int index) {
        this.from = array;
        this.index = index;
    }

    @Override
    public COSBase getFrom() {
        return from;
    }

    @Override
    public COSBase getTo() {
        return resolve(from.get(index));
    }

    @Override
    public void setTo(COSBase to) {
        from.set(index, to);
    }

    final COSArray from;
    final int index;
}

static class DictionaryReference implements Reference {
    public DictionaryReference(COSDictionary dictionary, COSName key) {
        this.from = dictionary;
        this.key = key;
    }

    @Override
    public COSBase getFrom() {
        return from;
    }

    @Override
    public COSBase getTo() {
        return resolve(from.getDictionaryObject(key));
    }

    @Override
    public void setTo(COSBase to) {
        from.setItem(key, to);
    }

    final COSDictionary from;
    final COSName key;
}

( OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference ) （ OptimizeAfterMerge辅助接口Reference实现ArrayReference和DictionaryReference ）

And the following helper methods and classes finally identify and merge duplicates: 以下辅助方法和类最终识别并合并重复项：

int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
    List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
    for (COSBase object : complexObjects.keySet()) {
        hashes.add(new HashOfCOSBase(object));
    }
    Collections.sort(hashes);

    int removedDuplicates = 0;
    if (!hashes.isEmpty()) {
        int runStart = 0;
        int runHash = hashes.get(0).hash;
        for (int i = 1; i < hashes.size(); i++) {
            int hash = hashes.get(i).hash;
            if (hash != runHash) {
                int runSize = i - runStart;
                if (runSize != 1) {
                    System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
                    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
                }
                runHash = hash;
                runStart = i;
            }
        }
        int runSize = hashes.size() - runStart;
        if (runSize != 1) {
            System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
            removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
        }
    }
    return removedDuplicates;
}

int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
    int removedDuplicates = 0;

    List<List<COSBase>> duplicateSets = new ArrayList<>();
    for (HashOfCOSBase entry : run) {
        COSBase element = entry.object;
        for (List<COSBase> duplicateSet : duplicateSets) {
            if (equals(element, duplicateSet.get(0))) {
                duplicateSet.add(element);
                element = null;
                break;
            }
        }
        if (element != null) {
            List<COSBase> duplicateSet = new ArrayList<>();
            duplicateSet.add(element);
            duplicateSets.add(duplicateSet);
        }
    }

    System.out.printf("Identified %d set(s) of identical objects in run.\n", duplicateSets.size());

    for (List<COSBase> duplicateSet : duplicateSets) {
        if (duplicateSet.size() > 1) {
            COSBase surviver = duplicateSet.remove(0);
            Collection<Reference> surviverReferences = complexObjects.get(surviver);
            for (COSBase object : duplicateSet) {
                Collection<Reference> references = complexObjects.get(object);
                for (Reference reference : references) {
                    reference.setTo(surviver);
                    surviverReferences.add(reference);
                }
                complexObjects.remove(object);
                removedDuplicates++;
            }
            surviver.setDirect(false);
        }
    }

    return removedDuplicates;
}

boolean equals(COSBase a, COSBase b) {
    if (a instanceof COSArray) {
        if (b instanceof COSArray) {
            COSArray aArray = (COSArray) a;
            COSArray bArray = (COSArray) b;
            if (aArray.size() == bArray.size()) {
                for (int i=0; i < aArray.size(); i++) {
                    if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
                        return false;
                }
                return true;
            }
        }
    } else if (a instanceof COSDictionary) {
        if (b instanceof COSDictionary) {
            COSDictionary aDict = (COSDictionary) a;
            COSDictionary bDict = (COSDictionary) b;
            Set<COSName> keys = aDict.keySet();
            if (keys.equals(bDict.keySet())) {
                for (COSName key : keys) {
                    if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
                        return false;
                }
                // In case of COSStreams we strictly speaking should
                // also compare the stream contents here. But apparently
                // their hashes coincide well enough for the original
                // hashing equality, so let's just assume...
                return true;
            }
        }
    }
    return false;
}

static COSBase resolve(COSBase object) {
    while (object instanceof COSObject)
        object = ((COSObject)object).getObject();
    return object;
}

( OptimizeAfterMerge helper methods mergeDuplicates , mergeRun , equals , and resolve ) （ OptimizeAfterMerge辅助方法mergeDuplicates ， mergeRun ， equals和resolve ）

static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
    public HashOfCOSBase(COSBase object) throws IOException {
        this.object = object;
        this.hash = calculateHash(object);
    }

    int calculateHash(COSBase object) throws IOException {
        if (object instanceof COSArray) {
            int result = 1;
            for (COSBase member : (COSArray)object)
                result = 31 * result + member.hashCode();
            return result;
        } else if (object instanceof COSDictionary) {
            int result = 3;
            for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
                result += entry.hashCode();
            if (object instanceof COSStream) {
                try (   InputStream data = ((COSStream)object).createRawInputStream()   ) {
                    MessageDigest md = MessageDigest.getInstance("MD5");
                    byte[] buffer = new byte[8192];
                    int bytesRead = 0;
                    while((bytesRead = data.read(buffer)) >= 0)
                        md.update(buffer, 0, bytesRead);
                    result = 31 * result + Arrays.hashCode(md.digest());
                } catch (NoSuchAlgorithmException e) {
                    throw new IOException(e);
                }
            }
            return result;
        } else {
            throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
        }
    }

    final COSBase object;
    final int hash;

    @Override
    public int compareTo(HashOfCOSBase o) {
        int result = Integer.compare(hash,  o.hash);
        if (result == 0)
            result = Integer.compare(hashCode(), o.hashCode());
        return result;
    }
}

( OptimizeAfterMerge helper class HashOfCOSBase ) （ OptimizeAfterMerge助手类HashOfCOSBase ）

Applying the code to the OP's example document 将代码应用于OP的示例文档

The OP's example document is about 6.5 MB in size. OP的示例文档大小约为6.5 MB。 Applying the above code like this 像这样应用上面的代码

PDDocument pdDocument = PDDocument.load(SOURCE);

optimize(pdDocument);

pdDocument.save(RESULT);

results in a PDF less than 700 KB in size, and it appears to be complete. 得到的PDF大小不到700 KB，看起来很完整。

(If something's missing, please tell, I'll try and fix that.) （如果缺少某些东西，请告诉我，我会尝试解决这个问题。）

Words of warning 警告的话

On one hand this optimizer will not recognize all identical duplicates. 一方面，这个优化器不会识别所有相同的重复项。 In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles. 特别是在循环引用的情况下，将不会识别对象的复制圆，因为如果它们的内容相同而在重复的对象圆中通常不会发生，则代码仅识别重复。

On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity. 另一方面，在某些情况下，此优化器可能已经过于急切，因为可能需要一些重复项作为PDF查看器的单独对象，以将每个实例作为单个实体接受。

Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure ( PDDocument , PDDocumentCatalog , PDAcroForm , ...). 此外，该程序接触文件中的所有类型的对象，甚至是定义PDF内部结构的对象，但它不会尝试更新管理此结构的任何PDFBox类（ PDDocument ， PDDocumentCatalog ， PDAcroForm ，...）。 To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado. 由于没有任何挂起的更改搞砸了整个文档，因此，请仅将此程序应用于新加载的，未经修改的PDDocument实例，并在不再PDDocument情况下立即保存。

Answer 2

When debugging in the file, I recognized that the font files for the same fonts were referenced several times. 在文件中调试时，我发现相同字体的字体文件被多次引用。 So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. 因此，使用已查看的字体文件项替换字典中的实际字体文件项，删除了引用并可以进行压缩。 By that, I was able to shrink a 30 MB File to around 6 MB. 通过这种方式，我能够将30 MB的文件缩小到大约6 MB。

    File file = new File("test.pdf");

    PDDocument doc = PDDocument.load(file);
    Map<String, COSBase> fontFileCache = new HashMap<>();
    for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
        final PDPage page = doc.getPage(pageNumber);
        COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
        for (COSName currentFont : pageDictionary.keySet()) {
            COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
            for (COSName actualFont : fontDictionary.keySet()) {
                COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
                if (actualFontDictionaryObject instanceof COSDictionary) {
                    COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
                    if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
                        COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
                        fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
                        fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
                    }
                }
            }
        }
    }

    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.save(baos);
    final File compressed = new File("test_compressed.pdf");
    baos.writeTo(new FileOutputStream(compressed));

Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility. 也许这不是最优雅的方式，但它可以工作并保持PDF / A-1b的兼容性。

Answer 3

An other way I found is using ITEXT 7 that way (pdfWriter.setSmartMode): 我发现的另一种方式是使用ITEXT 7（pdfWriter.setSmartMode）：

    try (PdfWriter pdfWriter = new PdfWriter(out)) {
        pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
        pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
        try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
                new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
            PdfMerger merger = new PdfMerger(pdfDoc);
            merger.setCloseSourceDocuments(true);
            try {
                for (InputStream pdf : pdfs) {
                    try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
                        merger.merge(doc, createPageList(doc.getNumberOfPages()));
                    }
                }
                merger.close();
            }
            catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
                throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
                        e);
            }
            catch (com.itextpdf.io.IOException | PdfException e) {
                throw new BieneException(e.getMessage(), e);
            }
        }
    }

如何使用pdfbox或其他java库减小合并的PDF / A-1b文件的大小

问题描述

3 个解决方案

解决方案1
4 2018-11-29 17:18:05

The optimization code 优化代码

Applying the code to the OP's example document 将代码应用于OP的示例文档

Words of warning 警告的话

解决方案2
3 已采纳 2018-11-30 02:40:07

解决方案3
1 2019-01-29 20:41:26

如何使用pdfbox或其他java库减小合并的PDF / A-1b文件的大小

问题描述

3 个解决方案

解决方案1 4 2018-11-29 17:18:05

The optimization code 优化代码

Applying the code to the OP's example document 将代码应用于OP的示例文档

Words of warning 警告的话

解决方案2 3 已采纳 2018-11-30 02:40:07

解决方案3 1 2019-01-29 20:41:26

解决方案1
4 2018-11-29 17:18:05

解决方案2
3 已采纳 2018-11-30 02:40:07

解决方案3
1 2019-01-29 20:41:26