简体   繁体   English

培训Tesseract OCR以消除歧义

[英]Training Tesseract OCR for ambiguities

I am pretty new to data scraping and I am facing a minor issue. 我对数据抓取非常陌生,并且面临一个小问题。

I am trying to extract text from a Hindi pdf using textract and Tesseract OCR. 我正在尝试使用textractTesseract OCR从印地语pdf中提取文本。 Following is the code in Python: 以下是Python中的代码:

import textract

text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')

Now, many of the words from the PDF are correctly extracted. 现在,可以正确提取PDF中的许多单词。 However, there are some things that are messed up. 但是,有些事情是混乱的。 I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs . 我阅读了文档,并了解了如何使用lang.unicharambigs文件来lang.unicharambigs However, I need to run combine_tessdata in order to actually bring it into effect and override certain trained data. 但是,我需要运行combine_tessdata才能使其真正生效并覆盖某些受过训练的数据。

However, when I try to run the command I get the following: 但是,当我尝试运行命令时,得到以下信息:

 -bash: combine_tessdata: command not found

I have installed tesseract from the source and I can't seem to understand why this is happening. 我从源头安装了tesseract ,但我似乎不明白为什么会这样。 Any ideas on how to troubleshoot this? 关于如何解决此问题的任何想法?

Thanks in advance! 提前致谢!

Tesseract training executables are built separately. Tesseract培训可执行文件是单独构建的。

https://github.com/tesseract-ocr/tesseract/wiki/Compiling https://github.com/tesseract-ocr/tesseract/wiki/Compiling

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM