[英]Training Tesseract OCR for ambiguities
I am pretty new to data scraping and I am facing a minor issue. 我对数据抓取非常陌生,并且面临一个小问题。
I am trying to extract text from a Hindi pdf using textract
and Tesseract
OCR. 我正在尝试使用textract
和Tesseract
OCR从印地语pdf中提取文本。 Following is the code in Python: 以下是Python中的代码:
import textract
text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')
Now, many of the words from the PDF are correctly extracted. 现在,可以正确提取PDF中的许多单词。 However, there are some things that are messed up. 但是,有些事情是混乱的。 I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs
. 我阅读了文档,并了解了如何使用lang.unicharambigs
文件来lang.unicharambigs
。 However, I need to run combine_tessdata
in order to actually bring it into effect and override certain trained data. 但是,我需要运行combine_tessdata
才能使其真正生效并覆盖某些受过训练的数据。
However, when I try to run the command I get the following: 但是,当我尝试运行命令时,得到以下信息:
-bash: combine_tessdata: command not found
I have installed tesseract
from the source and I can't seem to understand why this is happening. 我从源头安装了tesseract
,但我似乎不明白为什么会这样。 Any ideas on how to troubleshoot this? 关于如何解决此问题的任何想法?
Thanks in advance! 提前致谢!
Tesseract training executables are built separately. Tesseract培训可执行文件是单独构建的。
https://github.com/tesseract-ocr/tesseract/wiki/Compiling https://github.com/tesseract-ocr/tesseract/wiki/Compiling
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.