简体   繁体   中英

running multiple tesseract instances in parallel using multiprocessing not returning any results

I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel. when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).

In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.

Here's my code :

def processPage(i):



    nameJPG="converted-"+str(i)+".jpg"
    nameHocr="converted-"+str(i)
    p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
    print "tesseract did the job for the ",str(i+1),"page" 

pool1=Pool(4)
    pool1.map(processPage, range(len(pdf.pages)))

Your code is launching a Pool and exiting before it finishes its job. You need to call close and join .

pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()

Alternatively, you can wait for its results.

pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))

As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3 or other wise you can hit google vision api they have lot of servers and there output is quite good too Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment. but you must not run multiple process at same servers for tesseract

See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM