通过 SOLR 对 Tesseract 的 OCR 支持

Question

Good day, I'm trying to configure SOLR to use Tesseract OCR engine for text extraction from images, but did not have success yet.美好的一天，我正在尝试配置SOLR以使用Tesseract OCR引擎从图像中提取文本，但还没有成功。

SOLR extracting fine text from structured text documents (.xls, .pdf, doc, etc), but it does not want to call Tesseract module for text recognition. SOLR 从结构化文本文档（.xls、.pdf、doc 等）中提取精细文本，但它不想调用 Tesseract 模块进行文本识别。

I'm using我正在使用

SOLR v.7.4.0 SOLR v.7.4.0
Tesseract version 4.1.1 Tesseract 版本 4.1.1
TIKA 1.18 version (build-in in SOLR, no standalone version) TIKA 1.18 版本（SOLR 内置，无独立版本）

Tesseract is installed in to the following directory: Tesseract 安装在以下目录中：

/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0

Command tesseract test.jpg test.txt produces accurate txt file with OCRed content from test.jpg.命令tesseract test.jpg test.txt使用test.jpg 中的 OCR内容生成准确的 txt 文件。

solrconfig.xml , TesseractOCRConfig.properties , ParseContent.xml files were modified to point to Tesseract installation. solrconfig.xml 、 TesseractOCRConfig.properties 、 ParseContent.xml文件被修改为指向Tesseract安装。

Has anybody done such configuration ?有没有人做过这样的配置？

Answer 1

Good day, We solved the situation.美好的一天，我们解决了这个问题。 Here is what was used and changed: In our installation we used Tesseract version 3.05, Tika version 1.17, SOLR version 7.4.以下是使用和更改的内容：在我们的安装中，我们使用了 Tesseract 3.05 版、Tika 1.17 版、SOLR 7.4 版。 We actually, had TIKA version 1.17, not 18. 1. Changed from HOCR to TXT >>> in file parseContext.xml 2. Had to start SOLR as a root user.实际上，我们拥有 TIKA 1.17 版，而不是 18 版。 1. 在 parseContext.xml 文件中从 HOCR 更改为 TXT >>> 2. 必须以 root 用户身份启动 SOLR。 Version 4.1.1 is not compatible with TIKA 1.17 , so we will upgrade SOLR to version 7.7, TIKA version 1.19 and will try to install Tesseract 4.1.1 enter image description here 4.1.1 版本与 TIKA 1.17 不兼容，因此我们将 SOLR 升级到 7.7 版，TIKA 1.19 版并尝试安装 Tesseract 4.1.1在此处输入图片说明

通过 SOLR 对 Tesseract 的 OCR 支持

问题描述

1 个解决方案

解决方案1
1 2020-01-22 08:34:54

通过 SOLR 对 Tesseract 的 OCR 支持

问题描述

1 个解决方案

解决方案1 1 2020-01-22 08:34:54

解决方案1
1 2020-01-22 08:34:54