[英]Using Tesseract OCR with Solr 9.1
I had a set up running where I could extract in Solr (8.11.2 with tika 1.27) and get OCR from Tesseract (5.2.0).我有一个正在运行的设置,我可以在其中提取 Solr(8.11.2 和 tika 1.27)并从 Tesseract (5.2.0) 获取 OCR。
To do this i had updated TesseractOCRConfig.properties inside tika-parsers-1.27.jar with为此,我更新了 tika-parsers-1.27.jar 中的 TesseractOCRConfig.properties
tesseractPath=C:/Tesseract-OCR
tessdataPath=C:/Tesseract-OCR/tessdata/
language=dan
I am now trying to replicate the setup with solr 9.1 (Tika 1.28.4) and same Tesseract installation, the files are getting extracted, but I am not getting any OCR.我现在正在尝试使用 solr 9.1 (Tika 1.28.4) 和相同的 Tesseract 安装复制设置,文件正在被提取,但我没有得到任何 OCR。
In 9.1.0 i am getting the following when extracting a jpg file:在 9.1.0 中,我在提取 jpg 文件时得到以下信息:
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.jpeg.JpegParser"],
In a setup with 8.11.2 i am getting the following when extracting the same jpg:在 8.11.2 的设置中,我在提取相同的 jpg 时得到以下信息:
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.ocr.TesseractOCRParser",
"org.apache.tika.parser.jpeg.JpegParser"],
Turn of the security manager that is on by default in 9.x, this can be done by setting the environment variable:开启9.x默认开启的安全管理器,可以通过设置环境变量来实现:
SOLR_SECURITY_MANAGER_ENABLED=false
The issue is that org.apache.tika.parser.ocr.TesseractOCRParser
require execution rights on the folder where tesseract is installed.问题是org.apache.tika.parser.ocr.TesseractOCRParser
需要对安装 tesseract 的文件夹的执行权限。
When determening if TesseractOCRParser should be loaded it checks if it can locate and call Tesseract based on the configuaration, the check
method used to see if it can execute an external parser catches SecurityException
among other exceptions and just returns false without any logging, so there is no sign that something is configured wrong even if you turn up logging.当确定是否应该加载 TesseractOCRParser 时,它会检查它是否可以根据配置定位和调用 Tesseract,用于查看它是否可以执行外部解析器的check
方法会捕获SecurityException
等异常,并且只返回 false 而没有任何日志记录,所以有即使您打开日志记录,也没有迹象表明配置有误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.