培训Tesseract OCR以消除歧义

Question

I am pretty new to data scraping and I am facing a minor issue. 我对数据抓取非常陌生，并且面临一个小问题。

I am trying to extract text from a Hindi pdf using textract and Tesseract OCR. 我正在尝试使用textract和Tesseract OCR从印地语pdf中提取文本。 Following is the code in Python: 以下是Python中的代码：

import textract

text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')

Now, many of the words from the PDF are correctly extracted. 现在，可以正确提取PDF中的许多单词。 However, there are some things that are messed up. 但是，有些事情是混乱的。 I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs . 我阅读了文档，并了解了如何使用lang.unicharambigs文件来lang.unicharambigs 。 However, I need to run combine_tessdata in order to actually bring it into effect and override certain trained data. 但是，我需要运行combine_tessdata才能使其真正生效并覆盖某些受过训练的数据。

However, when I try to run the command I get the following: 但是，当我尝试运行命令时，得到以下信息：

 -bash: combine_tessdata: command not found

I have installed tesseract from the source and I can't seem to understand why this is happening. 我从源头安装了tesseract ，但我似乎不明白为什么会这样。 Any ideas on how to troubleshoot this? 关于如何解决此问题的任何想法？

Thanks in advance! 提前致谢！

Answer 1

Tesseract training executables are built separately. Tesseract培训可执行文件是单独构建的。

https://github.com/tesseract-ocr/tesseract/wiki/Compiling https://github.com/tesseract-ocr/tesseract/wiki/Compiling

培训Tesseract OCR以消除歧义

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-24 01:06:37

培训Tesseract OCR以消除歧义

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-24 01:06:37

解决方案1
2 已采纳 2016-03-24 01:06:37