I'm using pytesseract to do some ocr with a multiprocessing approach.
The approach looks the following:
tess_api = PyTessBaseAPI()
Parallel(n_jobs=4)(delayed(execute)(image) for image in images)
having the function:
def execute(image):
tess_api.SetImage(image)
text = tess_api.GetUTF8Text()
this will lead into a concurrency problem as Worker 1 could override the image before Worker 2 is executing gettext()
The idea is now to have per worker one instance of PyTessBaseAPI
. The main idea would be to do something like:
tess_apis = [PyTessBaseAPI(), PyTessBaseAPI(), PyTessBaseAPI(), PyTessBaseAPI()]
and then hand over tess_api[0]
to the worker 0 but I don't know how I could do the connection between the worker and the instance. Any suggestions, or what would be a better approach? As I have thousands of images I don't wan't to create instances inside the execute function.
Use Pool(initializer=...)
to initialize the Tesseract object once per worker process before they start reading their job queue.
tess_api = None
def initialize_worker():
global tess_api
tess_api = PyTessBaseAPI() # initialize a copy for this instance
def execute(image):
tess_api.SetImage(image)
text = tess_api.GetUTF8Text()
def main():
with multiprocessing.Pool(initializer=initialize_worker) as p:
for result in p.imap_unordered(images, chunksize=10):
# ...
This will only work if you're using actual processes; if you're using threads instead (which might work, considering Tesseract is C and would release the GIL), you could use contextvars
/ threading.local
.
Pool(initializer=...)
can work as mentioned. However, if you want to do anything more complex, I'd suggest using Ray .
Then it an be expressed as follows.
import ray
ray.init()
@ray.remote
class Worker(object):
def __init__(self):
self.tess_api = PyTessBaseAPI()
def execute(self, image):
self.tess_api.SetImage(image)
return self.tess_api.GetUTF8Text()
# Create several Worker actors.
workers = [Worker.remote() for _ in range(4)]
# Execute tasks on them in parallel.
result_ids = [worker.execute.remote(image) for worker in workers]
# Get the results
results = ray.get(result_ids)
You can read more about Ray in the documentation . Note that I help develop Ray.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.