使用Java从多个PDF中提取文本

Question

I have over 1000 PDF files and need to extract text from them and load into a .txt file. 我有1000多个PDF文件，需要从其中提取文本并加载到.txt文件中。 I could get the code for a single PDF file, but not successful from multiple PDFs. 我可以获取单个PDF文件的代码，但不能从多个PDF中获取代码。 My code as below - 我的代码如下-

Main 主要

package pdftest;`
import java.io.File;
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
String path = "C:\\Users\\arunk01\\Desktop\\Java_Extraction\\";

     String files;
     File folder = new File(path);
     File[] listOfFiles = folder.listFiles();

     for (int i = 0; i < listOfFiles.length; i++)
     {

     if (listOfFiles[i].isFile())
     {
     files = listOfFiles[i].getName();
     if (files.endsWith(".pdf") || files.endsWith(".PDF"))
     {
     System.out.println(files);
     String nfiles = "C:\\Users\\arunk01\\Desktop\\Java_Extraction\\";
     PDFManager pdfManager = new PDFManager();
     String pdfToText = pdfManager.pdftoText(nfiles+files);

     if (pdfToText == null) {
     System.out.println("PDF to Text Conversion failed.");
     }
     else {
     System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
     pdfManager.writeTexttoFile(pdfToText,nfiles+files+".txt");
     }
     }
    }
     }
     }
    }

Class 类

package pdftest;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
   private PDFTextStripper pdfStripper;
   private PDDocument pdDoc ;
   private COSDocument cosDoc ;
   private String pdftoText;

   private String Text ;
   private String filePath;
   private File file;

    public PDFManager() {

    }
   public String ToText() throws IOException
   {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;

       file = new File(filePath);
       parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper();
       pdDoc = new PDDocument(cosDoc);
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(1);
      // pdfStripper.setEndPage(10);

       // reading text from page 1 to 10
       // if you want to get text from full pdf file use this code
       pdfStripper.setEndPage(pdDoc.getNumberOfPages());

       Text = pdfStripper.getText(pdDoc);
       return Text;
   }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }
    public String pdftoText(String string) {
        // TODO Auto-generated method stub
        return Text;
    }
    public void writeTexttoFile(String pdfToText2, String string) {
        // TODO Auto-generated method stub

    }


}

I am not getting any error, but it says PDF to Text conversion failed (hits the if condition in Main) 我没有收到任何错误，但是它说PDF到文本的转换失败（命中Main中的if条件）

2016__00002685__00.PDF
PDF to Text Conversion failed.
2016__00002685__01.PDF
PDF to Text Conversion failed.
2016__100018__00.PDF
PDF to Text Conversion failed.
2016__100018__01.PDF
PDF to Text Conversion failed.

Can some one help me with the code to convert multiple PDFs to text. 有人可以用代码将多个PDF转换为文本的方式帮我吗。

Thanks, Arun 谢谢，阿伦

Answer 1

pdftoText method in PDFManager class returns text which is null. pdftoText方法PDFManager类返回文本是空。 You need to invoke ToText method. 您需要调用ToText方法。 Try this: 尝试这个：

public String pdftoText(String filePath) throws IOException {
        this.setFilePath(filePath);
        return ToText();
    }

Answer 2

In addition to @Unknown 's answer, the below could help PDFManager . 除了@Unknown的答案外，以下内容还可帮助PDFManager 。 It may be nicer if we had just one method either pdfToText() or ToText() in PDFManager . 如果我们只有一个方法或者它可能是更好pdfToText()或ToText()在PDFManager 。

public String ToText() throws IOException{
    PDDocument pdDoc=PDDocument(new File(filePath));
    //startPage=1 endPage=Integer.MAX_VALUE by default.
    return pdfStripper.getText(pdDoc);
}

使用Java从多个PDF中提取文本

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-08-27 04:23:44

解决方案2
0 2017-09-02 05:51:19

使用Java从多个PDF中提取文本

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-08-27 04:23:44

解决方案2 0 2017-09-02 05:51:19

解决方案1
1 已采纳 2017-08-27 04:23:44

解决方案2
0 2017-09-02 05:51:19