简体   繁体   English

使用 PDFbox 从 PDF 文件中删除图像

[英]delete am image from a PDF file using PDFbox

I am attempting to delete images from a PDF using java and PDFbox.我正在尝试使用 java 和 PDFbox 从 PDF 中删除图像。 The images are not inline, and the PDF does not have patterns or forms.图像不是内嵌的,PDF 没有图案或表格。 The pdf file contains 2 images. pdf 文件包含 2 张图像。 The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. PDFdebugger 工具显示资源 >>​​ XObject >> IM3 和 IM5。 The problem is: I display the output pdf file and the images are not deleted.问题是:我显示输出的pdf文件并且图像没有被删除。

public class DeleteImage {
    public static void removeImages(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));

        for (PDPage page : document.getPages()) {
            PDResources pdResources = page.getResources();
            pdResources.getXObjectNames().forEach(propertyName -> {
                if(!pdResources.isImageXObject(propertyName)) {
                    return;
                }
                PDXObject o;
                try {
                    o = pdResources.getXObject(propertyName);
                    if (o instanceof PDImageXObject) {
                        System.out.println("propertyName" + propertyName);
                        page.getCOSObject().removeItem(propertyName);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            });

            for (COSName name :  page.getResources().getPatternNames()) {
                PDAbstractPattern pattern = page.getResources().getPattern(name);
                System.out.println("have pattern");
            }
              
            PDFStreamParser parser = new PDFStreamParser(page);
            parser.parse();
            List<Object> tokens = parser.getTokens();
            System.out.println("original tokens size" + tokens.size());
            List<Object> newTokens = new ArrayList<Object>();

            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );
                if( token instanceof Operator ) {
                    Operator op = (Operator)token;

                    System.out.println("operation" + op.getName());
                    //find image - remove it
                    if( op.getName().equals("Do") ) {
                        System.out.println("op equals Do");
                        newTokens.remove(newTokens.size()-1);
                        continue;
                    } else if ("BI".equals(op.getName())) {
                        System.out.println("inline -- op equals BI");
                    } else {
                        System.out.println("op not quals Do");
                    }
                }
                newTokens.add(token);
            }

            PDDocument newDoc = new PDDocument();
            PDPage newPage = newDoc.importPage(page);
            newPage.setResources(page.getResources());

            System.out.println("tokens size" + newTokens.size());
            PDStream newContents = new PDStream(newDoc);
            OutputStream out = newContents.createOutputStream();
            ContentStreamWriter writer = new ContentStreamWriter( out );
            writer.writeTokens( newTokens);
            out.close();
            newPage.setContents( newContents );
        }

        document.save("RemoveImage.pdf");
        document.close();
    }

    public static void remove(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));
        PDResources resources = null;
        
        for (PDPage page : document.getPages()) {
            resources = page.getResources();

            for (COSName name : resources.getXObjectNames()) {
                PDXObject xobject = resources.getXObject(name);
                
                if (xobject instanceof PDImageXObject) {
                    System.out.println("have image");
                    removeImages(pdfFile);
                }
            }
        }
        document.save("RemoveImage.pdf");
        document.close();
    }
}

If You Call remove ...如果你打电话remove ...

In remove youremove

  • load the PDF into document ,将 PDF 加载到document
  • iterate over the pages of document , and for each page迭代document页面,并为每一页
    • iterate over the XObject resources, and for each Xobject迭代 XObject 资源,并为每个 Xobject
      • check whether it is an image Xobject, and if it is检查它是否是图像Xobject,如果是
        • call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".调用removeImages加载相同的原始文件,对其进行处理,并将结果保存为“RemoveImage.pdf”。
  • After all that processing you save the unchanged document to "RemoveImage.pdf".在所有这些处理之后,您将未更改的document保存到“RemoveImage.pdf”。

So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!因此,在最后一步中,您会覆盖您在removeImages所做的任何更改,并以“RemoveImage.pdf”中的原始文件结束!

If You Call removeImages Directly...如果您直接调用removeImages ...

In removeImages you do some changes but there are certain issues:removeImages您做了一些更改,但存在某些问题:

  • Whenever you find an image Xobject resource, you attempt to remove it from the page directly每当你找到一个图像 Xobject 资源时,你试图直接从页面中删除它

    page.getCOSObject().removeItem(propertyName);

    but the image Xobject resource is not a direct child of the page , it is managed by pdResources , so you should remove it from there.但是图像 Xobject 资源不是page的直接子pdResources ,它由pdResources管理,因此您应该从那里删除它。

  • You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.您从页面内容中删除了所有Do指令,而不仅仅是图像 Xobjects 的那些指令,因此您可能删除的比您想要的更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM