简体   繁体   English

将 Tesseract OCR 与 Solr 9.1 结合使用

[英]Using Tesseract OCR with Solr 9.1

I had a set up running where I could extract in Solr (8.11.2 with tika 1.27) and get OCR from Tesseract (5.2.0).我有一个正在运行的设置,我可以在其中提取 Solr(8.11.2 和 tika 1.27)并从 Tesseract (5.2.0) 获取 OCR。

To do this i had updated TesseractOCRConfig.properties inside tika-parsers-1.27.jar with为此,我更新了 tika-parsers-1.27.jar 中的 TesseractOCRConfig.properties

tesseractPath=C:/Tesseract-OCR
tessdataPath=C:/Tesseract-OCR/tessdata/
language=dan

I am now trying to replicate the setup with solr 9.1 (Tika 1.28.4) and same Tesseract installation, the files are getting extracted, but I am not getting any OCR.我现在正在尝试使用 solr 9.1 (Tika 1.28.4) 和相同的 Tesseract 安装复制设置,文件正在被提取,但我没有得到任何 OCR。

In 9.1.0 i am getting the following when extracting a jpg file:在 9.1.0 中,我在提取 jpg 文件时得到以下信息:

  "x_parsed_by":["org.apache.tika.parser.DefaultParser",
                 "org.apache.tika.parser.jpeg.JpegParser"],

In a setup with 8.11.2 i am getting the following when extracting the same jpg:在 8.11.2 的设置中,我在提取相同的 jpg 时得到以下信息:

    "x_parsed_by":["org.apache.tika.parser.DefaultParser",
                   "org.apache.tika.parser.ocr.TesseractOCRParser",
                   "org.apache.tika.parser.jpeg.JpegParser"],

Turn of the security manager that is on by default in 9.x, this can be done by setting the environment variable:开启9.x默认开启的安全管理器,可以通过设置环境变量来实现:

SOLR_SECURITY_MANAGER_ENABLED=false

The issue is that org.apache.tika.parser.ocr.TesseractOCRParser require execution rights on the folder where tesseract is installed.问题是org.apache.tika.parser.ocr.TesseractOCRParser需要对安装 tesseract 的文件夹的执行权限。

When determening if TesseractOCRParser should be loaded it checks if it can locate and call Tesseract based on the configuaration, the check method used to see if it can execute an external parser catches SecurityException among other exceptions and just returns false without any logging, so there is no sign that something is configured wrong even if you turn up logging.当确定是否应该加载 TesseractOCRParser 时,它会检查它是否可以根据配置定位和调用 Tesseract,用于查看它是否可以执行外部解析器的check方法会捕获SecurityException等异常,并且只返回 false 而没有任何日志记录,所以有即使您打开日志记录,也没有迹象表明配置有误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM