简体繁体 English

Android 中的 Tesseract 字符识别问题（但 iOS 上没有？）

[英]Tesseract character recognition problems in Android (but not on iOS?)

原文 2015-05-29 15:13:51 3 1 android/ ios/ ocr/ tesseract/ tess-two

I've build an application that uses Tesseract (V3.03 rc1) to identify some specific text strings.我已经构建了一个使用 Tesseract (V3.03 rc1) 来识别一些特定文本字符串的应用程序。 These are, unfortunately, printed on a custom font that requires that I build my own traineddata file.不幸的是，这些打印在自定义字体上，需要我构建自己的训练数据文件。 I've built the application on both iOS (using https://github.com/gali8/Tesseract-OCR-iOS for inspiration) and Android (using https://github.com/rmtheis/tess-two/ for inspiration as well).我已经在 iOS（使用https://github.com/gali8/Tesseract-OCR-iOS获得灵感）和 Android（使用https://github.com/rmtheis/tess-two/获得灵感）上构建了应用程序出色地）。

The workflow for both platforms is as follows:两个平台的工作流程如下：

I select a bounding box on the preview screen for where I can crop out the relevant text, and crop the image accordingly.我在预览屏幕上选择一个边界框，我可以在其中裁剪相关文本，并相应地裁剪图像。
I use OpenCV to get a binary image (using OpenCV's adaptive threshold function with the same parameters for both platforms)我使用 OpenCV 获取二进制图像（使用 OpenCV 的自适应阈值函数，两个平台的参数相同）
I pass this binary image to Tesseract.我将这个二进制图像传递给 Tesseract。 Both platforms (Android and iOS) use the same traineddata file.两个平台（Android 和 iOS）都使用相同的训练数据文件。

And yet, iOS recognizes the text strings perfectly, while Android keeps misidentifying certain characters (6s for Ss, As for Hs).然而，iOS 完美地识别了文本字符串，而 Android 不断地错误识别某些字符（Ss 为 6s，Hs 为 6s）。

On both platforms, I use the same white list string, I disable load_type_dawg and load_system_dawg, and also choose to save the blob choices.在两个平台上，我使用相同的白名单字符串，禁用 load_type_dawg 和 load_system_dawg，并选择保存 blob 选项。

Has anyone encountered this kind of situation before?有没有人遇到过这种情况？ Am I missing a setting on Android that's automatically handled in iOS?我是否缺少在 iOS 中自动处理的 Android 设置？ Is there something particular about Android that hasn't crossed my mind? Android 有什么特别的地方没有让我想到吗？

Any thoughts or advice would be greatly appreciated!任何想法或建议将不胜感激！

1 个解决方案

So, after a lot of work, I found out what was wrong with my Android application (thankfully, it wasn't an issue with Tesseract at all).因此，经过大量工作，我发现了我的 Android 应用程序出了什么问题（谢天谢地，这根本不是 Tesseract 的问题）。 As I'm more familiar with iOS apps than Android, I wasn't sure how I could load the traineddata file onto the application without requiring the user to have the file loaded on their external storage device.由于我比 Android 更熟悉 iOS 应用程序，因此我不确定如何在不要求用户将文件加载到其外部存储设备上的情况下将训练数据文件加载到应用程序中。 I found inspiration in this project ( http://www.codeproject.com/Tips/840623/Android-Character-Recognition ), as they autoload the trained data file.我在这个项目 ( http://www.codeproject.com/Tips/840623/Android-Character-Recognition ) 中找到了灵感，因为它们会自动加载经过训练的数据文件。

However, I misunderstood how it worked.但是，我误解了它是如何工作的。 I originally thought that the TessDataManager did a file lookup on the project's local tesseract/tessdata folder in order to get the trained data file (as I do this also on iOS).我最初认为 TessDataManager 在项目的本地 tesseract/tessdata 文件夹上进行了文件查找，以获取经过训练的数据文件（我也在 iOS 上这样做）。 However, that's not what it does.但是，这不是它的作用。 It, rather, checks the internal file structure (data/data/projectname/files/tesseract/tessdata/traineddatafilegoeshere) to see if the file exists and if it doesn't, it copies over the trained data file it keeps in the Resources/Raw directory.相反，它会检查内部文件结构（data/data/projectname/files/tesseract/tessdata/traineddatafilegoeshere）以查看文件是否存在，如果不存在，它会复制它保存在 Resources/ 中的训练数据文件原始目录。 In my case, it defaulted to the eng file, so it never read my custom font file.就我而言，它默认为 eng 文件，因此它从不读取我的自定义字体文件。

Hopefully this helps someone else having similar issues.希望这可以帮助其他有类似问题的人。 Thanks to Robin and RmTheis for all of your help!感谢 Robin 和 RmTheis 的所有帮助！