简体   繁体   English

python-tesseract OCR:仅获取数字

[英]python-tesseract OCR: get digits only

I'm using tesseract OCRwith python-tesseract. 我正在使用tesseract OCR with python-tesseract。 In the tesseract FAQ , regarding digits, we have: tesseract FAQ中 ,关于数字,我们有:

Use 使用

TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits: 在调用Init函数之前或将其放在名为tessdata / configs / digits的文本文件中:

tessedit_char_whitelist 0123456789

and then your command line becomes: 然后你的命令行变成:

tesseract image.tif outputbase nobatch digits

Warning: Until the old and new config variables get merged, you must have the nobatch parameter too. 警告:在旧的和新的配置变量合并之前,您还必须具有nobatch参数。

In python-tesseract, the SetVariable method exists. 在python-tesseract中,存在SetVariable方法。 I've tried this, but the result of the OCR is the same: 我试过这个,但是OCR的结果是一样的:

api = tesseract.TessBaseAPI()
api.SetVariable("tessedit_char_whitelist", "0123456789")
api.Init('.','eng',tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

Did anyone already got this working, or should I consider it a bug in python-tesseract? 有没有人已经有这个工作,或者我应该认为它是python-tesseract中的一个错误?

OK, got it working. 好的,搞定了。 According to this (unofficial ?) documentation of tesseract-ocr, SetVariable() must be called after Init(), even though the opposite is said in the official FAQ. 根据tesseract-ocr的这个(非官方?)文档 ,必须在Init()之后调用SetVariable(),即使官方常见问题解答中说的相反。 Calling it after Init() works as intended. 在Init()之后调用它按预期工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM