简体   繁体   English

从pdf文件中提取文本

[英]extract text from a pdf file

I am trying to extract text between "[" and "]" in a pdf file but I am unable to do so bcos the file seems to be encrypted. 我正在尝试提取pdf文件中“ [”和“]”之间的文本,但我无法这样做,因为bcos该文件似乎已加密。 I am getting some symbols which is not in readable format.. 我收到一些不是可读格式的符号。

public class ITextReadDemo {

      public static void main(String[] args) {
          try {
              PdfReader reader = new PdfReader("D:\\temp\\1.pdf");
              System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
              String page = PdfTextExtractor.getTextFromPage(reader, 2);
              System.out.println("Page Content:\n\n"+page+"\n\n");
              System.out.println("Is this document tampered : "+reader.isTampered());
              System.out.println("Is this document encrypted : "+reader.isEncrypted());

          } catch (IOException e) {
              e.printStackTrace();
          }
      }
}

but I am getting this exception: 但我得到这个例外:

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1OctetString
    at com.itextpdf.text.pdf.PdfEncryption.<init>(PdfEncryption.java:147)
    at com.itextpdf.text.pdf.PdfReader.readDecryptedDocObj(PdfReader.java:775)
    at com.itextpdf.text.pdf.PdfReader.readDocObj(PdfReader.java:1152)
    at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:512)
    at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:172)
    at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:161)
    at pdfexc.ITextReadDemo.main(ITextReadDemo.java:19)
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.asn1.ASN1OctetString
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 7 more

I tried the following way also. 我也尝试了以下方法。 It is reading the contents from the pdf file but when I display it, its not in the readable format 它正在从pdf文件中读取内容,但是当我显示它时,它的格式不是可读的

    void readfile() {
        Path path = Paths.get("D:\\temp\\1.pdf");
        Scanner scanner = new Scanner(path);
        while(scanner.hasNextLine()){
            String line = scanner.nextLine();
                System.out.println(line);
        }
}

All I need is the contents from the pdf file(not text file) as it is in readable format so that I can extract text b/w [ and ] using regex.. Please help me if you know the solution. 我需要的只是pdf文件(不是文本文件)中的内容,因为它是可读格式的,所以我可以使用regex提取文本b / w [和]。如果您知道解决方案,请帮助我。

The cause of your problems is already described by the exception: 出现问题的原因已由异常描述:

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1OctetString

IText uses the BouncyCastle library for security related tasks like encryption and signing and you seem to not have that library in your class path or at least not the required version of it. IText使用BouncyCastle库执行与安全相关的任务,例如加密和签名,并且您似乎在类路径中没有该库,或者至少没有该库的必需版本。

Unfortunately don't say which iText version you use so i cannot tell which BouncyCastle version is the required one. 不幸的是,不要说您使用的是哪个iText版本,所以我无法确定哪个是BouncyCastle版本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM