[英]how to use tessdata_best for tesseract (pytesseract). What are the arguments and procedure?
TL;DR: How do I install tessdata_best
to use with pytesseract
inside conda
in Ubuntu 18
? TL;DR:如何在
Ubuntu 18
的conda
中安装tessdata_best
以与pytesseract
一起使用?
I have been using pytesseract
inside conda
environment for quite some but there is a need to improve the accuracy and I found out that tessdata_best
gives you the best accuracy.我在
conda
环境中使用pytesseract
已经有一段时间了,但是需要提高准确性,我发现tessdata_best
为您提供最好的准确性。 How can I install and use that version?如何安装和使用该版本? I am using
Ubuntu 18
and have to work with pytesseract
.我正在使用
Ubuntu 18
并且必须使用pytesseract
。
I have my tesseract
installed at /usr/share/tesseract-ocr/
and inside it there is only 1 tessdata
.我的
tesseract
安装在/usr/share/tesseract-ocr/
里面,里面只有 1 个tessdata
。
Do I need to get the tessdata_best
from github by copying it to the directory /usr/share/tesseract-ocr/
alongside tessdata
?我是否需要通过将
tessdata_best
复制到 tessdata 旁边的目录/usr/share/tesseract-ocr/
来从 github 获取tessdata
?
Even then, if I want to use tessdata-best
, what do I have to use?即使那样,如果我想使用
tessdata-best
,我必须使用什么? Do I need to change the config
as --oem 0/1/2/3
?我需要将
config
更改为--oem 0/1/2/3
吗?
Third and last thing is that I have my language.trainedata
files at /home/deshwal/anaconda3/envs/py36/share/tessdata/eng.traineddata
.第三也是最后一件事是我的
language.trainedata
文件位于/home/deshwal/anaconda3/envs/py36/share/tessdata/eng.traineddata
。 Do I need to paste the tessdata_best
at this location too?我也需要在这个位置粘贴
tessdata_best
吗? Becuse when I try to change the language dir, it gives me error as as:因为当我尝试更改语言目录时,它给我的错误如下:
/home/deshwal/anaconda3/envs/py36/share/tessdata/equ.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'equ\' Tesseract couldn\'t load any languages! Could not initialize tesseract.'
I dont know if I understand your question clearly, however let me know if below helps ... You need to set datapath with location where you will copy the tessdata_best training models, For example,我不知道我是否清楚地理解了您的问题,但是如果下面有帮助,请告诉我...您需要设置数据路径,其中包含您将复制 tessdata_best 训练模型的位置,例如,
Tesseract tesseract = new Tesseract();正方体 tesseract = 新正方体(); // JNA Interface Mapping tesseract.setDatapath("/home/tesseract/tessdata_best_4_0_0/tessdata");
// JNA 接口映射 tesseract.setDatapath("/home/tesseract/tessdata_best_4_0_0/tessdata");
All your .traineddata files which you downloaded from ( https://github.com/tesseract-ocr/tessdata_best ) should be placed in the directory you define in setDataPath (for example:, /home/tesseract/tessdata_best_4_0_0/tessdata).您从 ( https://github.com/tesseract-ocr/tessdata_best ) 下载的所有 .traineddata 文件都应放在您在 setDataPath 中定义的目录中(例如:/home/tesseract/tessdata_best_4_0_0/tessdata)。
Please note: These models only work with the LSTM OCR engine of Tesseract 4 so make sure you have used library 4.1 or above.请注意:这些模型仅适用于 Tesseract 4 的 LSTM OCR 引擎,因此请确保您使用了库 4.1 或更高版本。
Regards, Maulik问候, 毛利克
According to the documentation of pytesseract, you can use config
argument with --tessdata-dir
, as follows :根据 pytesseract 的文档,您可以将
config
参数与--tessdata-dir
一起使用,如下所示:
# Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.
tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)
For more details see https://pypi.org/project/pytesseract/ .有关更多详细信息,请参阅https://pypi.org/project/pytesseract/ 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.