I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default ( -l eng
), some Japanese characters lost. Otherwise, if I run tesseract with japanese ( -l jpn
) some English characters lost (eg Email).
How can I run one process which recognize both English and Japanese characters?
Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter.
-l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes.
An example:
tesseract myscan.png out -l deu+eng
Try this:
custom_config = r'-l eng+jpn --psm 6'
txt = pytesseract.image_to_string(img, config=custom_config)
from langdetect import detect_langs
detect_langs(txt)
Note: you have to install langdetect by using:
pip install langdetect
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.