简体   繁体   English

如何在 Java 中设置 Tesseract OCR?

[英]How to set up Tesseract OCR in Java?

I am working on a pretty simple Java project in Visual Studio Code which requires the use of some basic optical character recognition, but I don't have any real experience in setting up APIs or accessing third party software in my code.我正在 Visual Studio Code 中处理一个非常简单的 Java 项目,该项目需要使用一些基本的光学字符识别,但我在设置 API 或访问我的代码中的第三方软件方面没有任何实际经验。 I'm using Maven to access a Tesseract package from Sourceforge and I got access to a Tesseract class (API?) which takes in a file path (which I believe is used to access the C++ side of things).我正在使用 Maven 从 Sourceforge 访问 Tesseract 包,并且我可以访问 Tesseract 类(API?),它接受文件路径(我相信它用于访问 C++ 方面的东西)。 I used homebrew to install Tesseract and it gave me the file path:我使用自制软件安装 Tesseract,它给了我文件路径:

/usr/local/Cellar/tesseract/4.1.1

but when I plug that into this但是当我把它插入这个

Tesseract instance = new Tesseract();
instance.setDatapath("/usr/local/Cellar/tesseract/4.1.1");

and run the doOCR method it always results in the same Null-Pointer errors, which makes me think that it isn't correctly accessing Tesseract- especially because the same errors will appear regardless of the file path I input.并运行 doOCR 方法,它总是会导致相同的空指针错误,这让我认为它没有正确访问 Tesseract - 特别是因为无论我输入的文件路径如何,都会出现相同的错误。 These are the errors...这些是错误...

21:25:56.021 [main] ERROR net.sourceforge.tess4j.Tesseract - null
java.lang.NullPointerException: null
        at net.sourceforge.tess4j.Tesseract.dispose(Tesseract.java:819)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:239)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
        at com.liamross.tess4j.TessClass.main(TessClass.java:14)
Exception in thread "main" net.sourceforge.tess4j.TesseractException: java.lang.NullPointerException
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:245)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
        at com.liamross.tess4j.TessClass.main(TessClass.java:14)
Caused by: java.lang.NullPointerException
        at net.sourceforge.tess4j.Tesseract.dispose(Tesseract.java:819)
        at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:239)

I followed this article as closely as I could, but no matter what I always receive these errors.我尽可能地密切关注这篇文章,但无论如何我总是收到这些错误。 Also, it seems that there is another directory or library called libtesseract?另外,似乎还有另一个名为 libtesseract 的目录或库? I'm not quite sure what that's used for or if it's something that I would need...我不太确定它的用途是什么,或者它是否是我需要的东西......

I know this is a bit of an ambitious project for someone who doesn't have a ton of experience, but any help would be greatly appreciated - I've put a lot of time into trying to figure this out and there doesn't seem to be much comprehensible material about this.我知道对于没有大量经验的人来说,这是一个雄心勃勃的项目,但任何帮助将不胜感激 - 我已经花了很多时间试图解决这个问题,但似乎没有是很容易理解的材料。 Here's a screenshot of what I have so far这是我到目前为止所拥有的截图

Thanks!谢谢!

I've reproduced your project and at the first time, the same error shown.我已经复制了您的项目,并且第一次显示了相同的错误。 That's because the Tesseract version is not compatible.那是因为 Tesseract 版本不兼容。 Here is the solution:这是解决方案:

  1. Install the Tesseract4 .安装Tesseract4 My machine is Win10-64bit, so i installed tesseract-ocr-w64-setup-v4.0.0.20181030.exe .我的机器是 Win10-64bit,所以我安装了tesseract-ocr-w64-setup-v4.0.0.20181030.exe Make sure it's installed successfully.确保它安装成功。

  2. Cleaning the Java Language Server Worspace in VS Code, then run again.在 VS Code 中清理 Java 语言服务器 Worspace ,然后再次运行。

Pay attention to the path of tessdata and .注意tessdata和 . jpg file. jpg文件。

Set instance.setHocr(false) to make sure the content was read correctly:设置instance.setHocr(false)以确保正确读取内容:

在此处输入图片说明

instance.setHocr(true) will show you xml code: instance.setHocr(true)会显示 xml 代码:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM