使用 PDFbox 从 PDF 文件中删除图像

Question

I am attempting to delete images from a PDF using java and PDFbox.我正在尝试使用 java 和 PDFbox 从 PDF 中删除图像。 The images are not inline, and the PDF does not have patterns or forms.图像不是内嵌的，PDF 没有图案或表格。 The pdf file contains 2 images. pdf 文件包含 2 张图像。 The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. PDFdebugger 工具显示资源 >> XObject >> IM3 和 IM5。 The problem is: I display the output pdf file and the images are not deleted.问题是：我显示输出的pdf文件并且图像没有被删除。

public class DeleteImage {
    public static void removeImages(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));

        for (PDPage page : document.getPages()) {
            PDResources pdResources = page.getResources();
            pdResources.getXObjectNames().forEach(propertyName -> {
                if(!pdResources.isImageXObject(propertyName)) {
                    return;
                }
                PDXObject o;
                try {
                    o = pdResources.getXObject(propertyName);
                    if (o instanceof PDImageXObject) {
                        System.out.println("propertyName" + propertyName);
                        page.getCOSObject().removeItem(propertyName);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            });

            for (COSName name :  page.getResources().getPatternNames()) {
                PDAbstractPattern pattern = page.getResources().getPattern(name);
                System.out.println("have pattern");
            }
              
            PDFStreamParser parser = new PDFStreamParser(page);
            parser.parse();
            List<Object> tokens = parser.getTokens();
            System.out.println("original tokens size" + tokens.size());
            List<Object> newTokens = new ArrayList<Object>();

            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );
                if( token instanceof Operator ) {
                    Operator op = (Operator)token;

                    System.out.println("operation" + op.getName());
                    //find image - remove it
                    if( op.getName().equals("Do") ) {
                        System.out.println("op equals Do");
                        newTokens.remove(newTokens.size()-1);
                        continue;
                    } else if ("BI".equals(op.getName())) {
                        System.out.println("inline -- op equals BI");
                    } else {
                        System.out.println("op not quals Do");
                    }
                }
                newTokens.add(token);
            }

            PDDocument newDoc = new PDDocument();
            PDPage newPage = newDoc.importPage(page);
            newPage.setResources(page.getResources());

            System.out.println("tokens size" + newTokens.size());
            PDStream newContents = new PDStream(newDoc);
            OutputStream out = newContents.createOutputStream();
            ContentStreamWriter writer = new ContentStreamWriter( out );
            writer.writeTokens( newTokens);
            out.close();
            newPage.setContents( newContents );
        }

        document.save("RemoveImage.pdf");
        document.close();
    }

    public static void remove(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));
        PDResources resources = null;
        
        for (PDPage page : document.getPages()) {
            resources = page.getResources();

            for (COSName name : resources.getXObjectNames()) {
                PDXObject xobject = resources.getXObject(name);
                
                if (xobject instanceof PDImageXObject) {
                    System.out.println("have image");
                    removeImages(pdfFile);
                }
            }
        }
        document.save("RemoveImage.pdf");
        document.close();
    }
}

Answer 1

If You Call `remove` ...如果你打电话`remove` ...

In remove you在remove你

load the PDF into document ,将 PDF 加载到document ，
iterate over the pages of document , and for each page迭代document页面，并为每一页
- iterate over the XObject resources, and for each Xobject迭代 XObject 资源，并为每个 Xobject
  - check whether it is an image Xobject, and if it is检查它是否是图像Xobject，如果是
    - call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".调用removeImages加载相同的原始文件，对其进行处理，并将结果保存为“RemoveImage.pdf”。
After all that processing you save the unchanged document to "RemoveImage.pdf".在所有这些处理之后，您将未更改的document保存到“RemoveImage.pdf”。

So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!因此，在最后一步中，您会覆盖您在removeImages所做的任何更改，并以“RemoveImage.pdf”中的原始文件结束！

If You Call `removeImages` Directly...如果您直接调用`removeImages` ...

In removeImages you do some changes but there are certain issues:在removeImages您做了一些更改，但存在某些问题：

Whenever you find an image Xobject resource, you attempt to remove it from the page directly每当你找到一个图像 Xobject 资源时，你试图直接从页面中删除它
```
page.getCOSObject().removeItem(propertyName);
```
but the image Xobject resource is not a direct child of the page , it is managed by pdResources , so you should remove it from there.但是图像 Xobject 资源不是page的直接子pdResources ，它由pdResources管理，因此您应该从那里删除它。
You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.您从页面内容中删除了所有Do指令，而不仅仅是图像 Xobjects 的那些指令，因此您可能删除的比您想要的更多。

使用 PDFbox 从 PDF 文件中删除图像

问题描述

1 个解决方案

解决方案1
1 2020-08-26 16:04:05

If You Call `remove` ...如果你打电话`remove` ...

If You Call `removeImages` Directly...如果您直接调用`removeImages` ...

使用 PDFbox 从 PDF 文件中删除图像

问题描述

1 个解决方案

解决方案1 1 2020-08-26 16:04:05

If You Call remove ...如果你打电话remove ...

If You Call removeImages Directly...如果您直接调用removeImages ...

解决方案1
1 2020-08-26 16:04:05

If You Call `remove` ...如果你打电话`remove` ...

If You Call `removeImages` Directly...如果您直接调用`removeImages` ...