简体   繁体   English

使用Java从多个PDF中提取文本

[英]Extract text from multiple PDFs using Java

I have over 1000 PDF files and need to extract text from them and load into a .txt file. 我有1000多个PDF文件,需要从其中提取文本并加载到.txt文件中。 I could get the code for a single PDF file, but not successful from multiple PDFs. 我可以获取单个PDF文件的代码,但不能从多个PDF中获取代码。 My code as below - 我的代码如下-

Main 主要

package pdftest;`
import java.io.File;
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
String path = "C:\\Users\\arunk01\\Desktop\\Java_Extraction\\";

     String files;
     File folder = new File(path);
     File[] listOfFiles = folder.listFiles();

     for (int i = 0; i < listOfFiles.length; i++)
     {

     if (listOfFiles[i].isFile())
     {
     files = listOfFiles[i].getName();
     if (files.endsWith(".pdf") || files.endsWith(".PDF"))
     {
     System.out.println(files);
     String nfiles = "C:\\Users\\arunk01\\Desktop\\Java_Extraction\\";
     PDFManager pdfManager = new PDFManager();
     String pdfToText = pdfManager.pdftoText(nfiles+files);

     if (pdfToText == null) {
     System.out.println("PDF to Text Conversion failed.");
     }
     else {
     System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
     pdfManager.writeTexttoFile(pdfToText,nfiles+files+".txt");
     }
     }
    }
     }
     }
    }

Class

package pdftest;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
   private PDFTextStripper pdfStripper;
   private PDDocument pdDoc ;
   private COSDocument cosDoc ;
   private String pdftoText;

   private String Text ;
   private String filePath;
   private File file;

    public PDFManager() {

    }
   public String ToText() throws IOException
   {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;

       file = new File(filePath);
       parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper();
       pdDoc = new PDDocument(cosDoc);
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(1);
      // pdfStripper.setEndPage(10);

       // reading text from page 1 to 10
       // if you want to get text from full pdf file use this code
       pdfStripper.setEndPage(pdDoc.getNumberOfPages());

       Text = pdfStripper.getText(pdDoc);
       return Text;
   }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }
    public String pdftoText(String string) {
        // TODO Auto-generated method stub
        return Text;
    }
    public void writeTexttoFile(String pdfToText2, String string) {
        // TODO Auto-generated method stub

    }


}

I am not getting any error, but it says PDF to Text conversion failed (hits the if condition in Main) 我没有收到任何错误,但是它说PDF到文本的转换失败(命中Main中的if条件)

2016__00002685__00.PDF
PDF to Text Conversion failed.
2016__00002685__01.PDF
PDF to Text Conversion failed.
2016__100018__00.PDF
PDF to Text Conversion failed.
2016__100018__01.PDF
PDF to Text Conversion failed.

Can some one help me with the code to convert multiple PDFs to text. 有人可以用代码将多个PDF转换为文本的方式帮我吗。

Thanks, Arun 谢谢,阿伦

pdftoText method in PDFManager class returns text which is null. pdftoText方法PDFManager类返回文本是空。 You need to invoke ToText method. 您需要调用ToText方法。 Try this: 尝试这个:

public String pdftoText(String filePath) throws IOException {
        this.setFilePath(filePath);
        return ToText();
    }

In addition to @Unknown 's answer, the below could help PDFManager . 除了@Unknown的答案外,以下内容还可帮助PDFManager It may be nicer if we had just one method either pdfToText() or ToText() in PDFManager . 如果我们只有一个方法或者它可能是更好pdfToText()ToText()PDFManager

public String ToText() throws IOException{
    PDDocument pdDoc=PDDocument(new File(filePath));
    //startPage=1 endPage=Integer.MAX_VALUE by default.
    return pdfStripper.getText(pdDoc);
} 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM