简体   繁体   English

使用 Tesseract OCR 将 PDF 转换为文本

[英]Converting a PDF to text using Tesseract OCR

AIM : convert a PDF to base64 where PDF can be a general PDF or a scanned one.目标:将 PDF 转换为 base64,其中 PDF 可以是一般 PDF 或扫描的 PDF。

I am using Tesseract OCR for converting scanned PDFs to text files.我正在使用 Tesseract OCR 将扫描的 PDF 转换为文本文件。 Since I am working in Java, I am using terr4j library for this.由于我在 Java 中工作,因此我为此使用了terr4j库。

The flow of program as I have thought would be as follows:我认为的程序流程如下:

Get PDF file ---> Convert each page to image using Ghost4j ---> Pass each image to tess4f for OCR ---> convert whole text to base64 . Get PDF file ---> Convert each page to image using Ghost4j ---> Pass each image to tess4f for OCR ---> convert whole text to base64

I have been able to convert a PDF file to Images using following code:我已经能够使用以下代码将 PDF 文件转换为图像:

package helpers;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import java.awt.Image;
import java.awt.image.RenderedImage;
import java.util.List;
import javax.imageio.ImageIO;

import org.ghost4j.document.DocumentException;
import org.ghost4j.document.PDFDocument;
import org.ghost4j.analyzer.FontAnalyzer;
import org.ghost4j.renderer.RendererException;
import org.ghost4j.renderer.SimpleRenderer;
import net.sourceforge.tess4j.*;

class encoder {
    public static byte[] createByteArray(File pCurrentFolder, String pNameOfBinaryFile) {
        String pathToBinaryData = pCurrentFolder.getAbsolutePath()+"/"+pNameOfBinaryFile;

        File file = new File(pathToBinaryData);
        if (!file.exists()) {
            System.out.println(pNameOfBinaryFile+" could not be found in folder "+pCurrentFolder.getName());
            return null;
        }

        FileInputStream fin = null;
        try {
            fin = new FileInputStream(file);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        byte fileContent[] = new byte[(int) file.length()];
        try {
            if (fin != null)
                fin.read(fileContent);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return fileContent;
    }

    public void covertToImage(File pdfDoc) {
        PDFDocument document = new PDFDocument();
        try {
            document.load(pdfDoc);
        } catch (IOException e) {
            e.printStackTrace();
        }
        SimpleRenderer renderer = new SimpleRenderer();
        renderer.setResolution(300);
        List<Image> images = null;
        try {
            images = renderer.render(document);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (RendererException e) {
            e.printStackTrace();
        } catch (DocumentException e) {
            e.printStackTrace();
        }
        try {
            if (images != null) {
                // for testing only 1 page
                ImageIO.write((RenderedImage) images.get(10), "png", new File("/home/cloudera/Downloads/1.png"));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

public class encodeFile {
    public static void main(String[] args) {
        /* This part is for pure PDF files i.e. not scanned */
        //byte[] arr = encoder.createByteArray(new File("/home/cloudera/Downloads/"), "test.pdf");
        //String result = javax.xml.bind.DatatypeConverter.printBase64Binary(arr);
        //System.out.println(result);

        /* This part create the image for a page of scanned PDF file */
        new encoder().covertToImage(new File("/home/cloudera/Downloads/isl99201.pdf")); // results in 1.png

        /* This part is for OCR */
        Tesseract instance = new Tesseract();
        String res = instance.doOCR(new File("/home/cloudera/Downloads/1.png"));
        System.out.println(res);
    }
}

Running this produces these errors:运行它会产生这些错误:

This occurs when I try to create an image from the PDF.当我尝试从 PDF 创建图像时会发生这种情况。 I have seen that if I remove tess4j from build.sbt, image is created with out any errors but I have to use it with that.我已经看到,如果我从 build.sbt 中删除tess4j ,创建的图像没有任何错误,但我必须使用它。

Connected to the target VM, address: '127.0.0.1:46698', transport: 'socket'
Exception in thread "main" java.lang.AbstractMethodError: com.sun.jna.Structure.getFieldOrder()Ljava/util/List;
    at com.sun.jna.Structure.fieldOrder(Structure.java:884)
    at com.sun.jna.Structure.getFields(Structure.java:910)
    at com.sun.jna.Structure.deriveLayout(Structure.java:1058)
    at com.sun.jna.Structure.calculateSize(Structure.java:982)
    at com.sun.jna.Structure.calculateSize(Structure.java:949)
    at com.sun.jna.Structure.allocateMemory(Structure.java:375)
    at com.sun.jna.Structure.<init>(Structure.java:184)
    at com.sun.jna.Structure.<init>(Structure.java:172)
    at com.sun.jna.Structure.<init>(Structure.java:159)
    at com.sun.jna.Structure.<init>(Structure.java:151)
    at org.ghost4j.GhostscriptLibrary$display_callback_s.<init>(GhostscriptLibrary.java:63)
    at org.ghost4j.Ghostscript.buildNativeDisplayCallback(Ghostscript.java:381)
    at org.ghost4j.Ghostscript.initialize(Ghostscript.java:336)
    at org.ghost4j.renderer.SimpleRenderer.run(SimpleRenderer.java:105)
    at org.ghost4j.renderer.AbstractRemoteRenderer.render(AbstractRemoteRenderer.java:86)
    at org.ghost4j.renderer.AbstractRemoteRenderer.render(AbstractRemoteRenderer.java:70)
    at helpers.encoder.covertToImage(encodeFile.java:62)
    at helpers.encodeFile.main(encodeFile.java:86)
Disconnected from the target VM, address: '127.0.0.1:46698', transport: 'socket'

Process finished with exit code 1

This error occurs while passing any image to tess4j :将任何图像传递给tess4j时会发生此错误:

Connected to the target VM, address: '127.0.0.1:46133', transport: 'socket'
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract.so) not found in resource path (....)
    at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:271)
    at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
    at com.sun.jna.Library$Handler.<init>(Library.java:147)
    at com.sun.jna.Native.loadLibrary(Native.java:412)
    at com.sun.jna.Native.loadLibrary(Native.java:391)
    at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:78)
    at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:40)
    at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:360)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:273)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:205)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:189)
    at helpers.encodeFile.main(encodeFile.java:89)
Disconnected from the target VM, address: '127.0.0.1:46133', transport: 'socket'

Process finished with exit code 1

I am working on Intellij using SBT on 64 bit CentOS 6.6.我正在 64 位 CentOS 6.6 上使用 SBT 开发Intellij By some internet search I have able to understand the issues above but I am facing two constraints:通过一些互联网搜索,我能够理解上述问题,但我面临两个限制:

  • The JNA library that is being used is by default of the latest version ie 4.1.0 .默认情况下,正在使用的 JNA 库是最新版本,即4.1.0 I read on the internet about the incompatibility between JNA and other libraries this can occur.我在互联网上阅读了有关 JNA 与其他库之间可能发生的不兼容的信息。 So I tried to specify the older version of 3.4.0.所以我尝试指定3.4.0的旧版本。 But build.sbt keeps rejecting that.但是 build.sbt 一直拒绝那个。

  • I am on a 64 Bit system and tessearct would work with a 32 Bit system.我在 64 位系统上, tessearct可以在 32 位系统上工作。 How should I integrate it in the project?我应该如何将它集成到项目中?

Following is the part from build.sbt which handles all the required libraries:以下是build.sbt中处理所有必需库的部分:

"org.ghost4j" % "ghost4j" % "0.5.1",
"org.bouncycastle" % "bctsp-jdk14" % "1.46",
"net.sourceforge.tess4j" % "tess4j" % "2.0.0",
"com.github.jai-imageio" % "jai-imageio-core" % "1.3.0"
 "net.java.dev.jna" % "jna" % "3.4.0", // does not make any difference as only 4.1.0 is installed.

Please help me out in this problem.请帮我解决这个问题。

UPDATE : I added "net.java.dev.jna" % "jna" % "3.4.0" force() to build.sbt and it solved my first problem.更新:我加入"net.java.dev.jna" % "jna" % "3.4.0" force()build.sbt ,它解决了我的第一个问题。

The solution to this issue lies in the Tesseract-API that I found on github.这个问题的解决方案在于我在github上找到的Tesseract-API I forked it into my Github account and added a test for a scanned image and did some code refactoring.我将它分叉到我的 Github帐户中,并添加了对扫描图像的测试并进行了一些代码重构。 This way to library started to function properly.这种方式到库开始正常运行。 The scanned doc I used for testing is here .我用于测试的扫描文档在这里

I built it successfully on Travis and now it working fine on 32 as well as 64 bit systems.我在 Travis 上成功构建了它,现在它可以在 32 位和 64 位系统上正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM