简体   繁体   English

Itext Pdf 可以提取文本,但无法使用所有者密码从 pdf 复制页面?

[英]Itext Pdf can extract text but can not copy page from pdf with owner password?

I'm using itext pdf for the java programming language to extract text from a pdf document.我正在使用 itext pdf 的 java 编程语言从 Z4371745BA41917294D09 文档中提取文本。 With a PdfReaderContentParser approach, it is possible to extract the desired textual content.使用PdfReaderContentParser方法,可以提取所需的文本内容。 But my PdfCopy approach results in an IllegalArgumentException .但是我的PdfCopy方法会导致IllegalArgumentException

This is the example of my PdfReaderContentParser approach:这是我的PdfReaderContentParser方法的示例:

import java.io.File;
import java.io.FileOutputStream;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class PdfExtract {

    public PdfExtract()  throws Exception{
        File f = new File( "D:/disertasi orang penting/OSD/disertasi doktor OSD.pdf" );
        PdfReader reader = new PdfReader(f.getPath()); 
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextExtractionStrategy strategy;
        PrintWriter out2 = new PrintWriter(new FileOutputStream( "D:/disertasi orang penting/OSD/hasil-1.txt" ));
        for(int i = 1; i  <=  reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i , new SimpleTextExtractionStrategy()); 
            String result = strategy.getResultantText(); 
            out2.println(result);
            out2.flush();
            
        }
        out2.close();
    }
    
    public static void main(String[] args ) throws Exception {
        new PdfExtract();
    }
}

And this is an example of method PdfCopy approach:这是方法PdfCopy方法的一个示例:

import java.io.FileOutputStream;

import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;

public class TestExtractPdf {

    public static void main(String[] args) {
        try {
            String dirHasil = "D:/disertasi orang penting/OSD/hasil/";
            PdfReader reader = new PdfReader("D:/disertasi orang penting/OSD/disertasi doktor OSD.pdf");
            int n = reader.getNumberOfPages();
            System.out.println("Number of pages : " + n);
            
            for (int i = 0; i < n; i++) {
                String outfile = dirHasil + Integer.toString(i + 1) + ".pdf";
                System.out.println("Writing " + outfile);
                Document document = new Document(reader.getPageSizeWithRotation(1));
                PdfCopy copy = new PdfCopy(document, new FileOutputStream(outfile));
                document.open();
                PdfImportedPage page = copy.getImportedPage(reader, i );
                copy.addPage(page);
                document.close();
                copy.close();
            }
        } catch (Exception e) {
            System.out.println("eror");
            e.printStackTrace();
        }
    }
}

and it is raise an eror:它引发了一个错误:

java.lang.IllegalArgumentException: PdfReader not opened with owner password
    at com.itextpdf.text.pdf.PdfReaderInstance.getImportedPage(PdfReaderInstance.java:80)
    at com.itextpdf.text.pdf.PdfCopy.getImportedPageImpl(PdfCopy.java:388)
    at com.itextpdf.text.pdf.PdfCopy.getImportedPage(PdfCopy.java:255)
    at fjr.cpns.kemenkeu.TestExtractPdf.main(TestExtractPdf.java:25)

Itext Pdf can extract text but can not copy page from pdf with owner password? Itext Pdf 可以提取文本,但无法使用所有者密码从 pdf 复制页面?

Indeed, the iText 5 implementation of permissions of encrypted files is a bit weird: for some functionalities (in particular stamping and page copying) the owner password is required in case of encrypted files, for most other functionalities not.实际上,加密文件权限的 iText 5 实现有点奇怪:对于某些功能(特别是标记和页面复制),在加密文件的情况下需要所有者密码,而对于大多数其他功能则不需要。

As the PDF permission structure is designed for PDF viewers and editors with a GUI and not for programming libraries, you cannot seriously implement them in iText.由于 PDF 权限结构是为带有 GUI 的 PDF 查看器和编辑器而不是为编程库设计的,因此您不能在 iText 中认真实现它们。 The implementation as done could have served to demonstrate to Adobe that iText does respect this part of the PDF references back when the format PDF still was Adobe proprietary.所完成的实现可以向 Adobe 证明 iText 确实尊重 PDF 引用的这一部分,而格式 PDF 仍然是 Adobe 专有的。

iText 5 since version 5.0.2 provides a way, though, to override these restrictions, simply initialize iText 5 因为版本 5.0.2 提供了一种方法,但是,要覆盖这些限制,只需初始化

PdfReader.unethicalreading = true;

before your code to make iText code assume you have opened arbitrary encrypted PDFs with the owner password and, therefore, have full permissions.在您制作 iText 代码的代码之前,假设您已经使用所有者密码打开了任意加密的 PDF,因此拥有完全权限。


There is an actual error in your code, though:但是,您的代码中有一个实际错误:

Itext page numbering is 1-based, ie the first page is numbered 1. Itext 页码从 1 开始,即第一页编号为 1。

In your working text extraction you respect this and start with 1:在您的工作文本提取中,您尊重这一点并从 1 开始:

for(int i = 1; i  <=  reader.getNumberOfPages(); i++) {
    strategy = parser.processContent(i , new SimpleTextExtractionStrategy()); 
    ...

In your not-working copying code, though, you don't respect it and start with 0:但是,在您无法正常工作的复制代码中,您不尊重它并从 0 开始:

for (int i = 0; i < n; i++) {
    ...
    PdfImportedPage page = copy.getImportedPage(reader, i );
    ...

To fix this, start with 1.要解决此问题,请从 1 开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM