简体   繁体   中英

python multiprocessing, make instance per process and reuse it

I'm using pytesseract to do some ocr with a multiprocessing approach.

The approach looks the following:

tess_api = PyTessBaseAPI()
Parallel(n_jobs=4)(delayed(execute)(image) for image in images)

having the function:

def execute(image):
    tess_api.SetImage(image)
    text = tess_api.GetUTF8Text()

this will lead into a concurrency problem as Worker 1 could override the image before Worker 2 is executing gettext()

The idea is now to have per worker one instance of PyTessBaseAPI . The main idea would be to do something like:

tess_apis = [PyTessBaseAPI(), PyTessBaseAPI(), PyTessBaseAPI(), PyTessBaseAPI()]

and then hand over tess_api[0] to the worker 0 but I don't know how I could do the connection between the worker and the instance. Any suggestions, or what would be a better approach? As I have thousands of images I don't wan't to create instances inside the execute function.

Use Pool(initializer=...) to initialize the Tesseract object once per worker process before they start reading their job queue.

tess_api = None

def initialize_worker():
    global tess_api
    tess_api = PyTessBaseAPI()  # initialize a copy for this instance

def execute(image):
    tess_api.SetImage(image)
    text = tess_api.GetUTF8Text()

def main():
    with multiprocessing.Pool(initializer=initialize_worker) as p:
        for result in p.imap_unordered(images, chunksize=10):
            # ...

This will only work if you're using actual processes; if you're using threads instead (which might work, considering Tesseract is C and would release the GIL), you could use contextvars / threading.local .

Pool(initializer=...) can work as mentioned. However, if you want to do anything more complex, I'd suggest using Ray .

Then it an be expressed as follows.

import ray

ray.init()

@ray.remote
class Worker(object):
    def __init__(self):
        self.tess_api = PyTessBaseAPI()

    def execute(self, image):
        self.tess_api.SetImage(image)
        return self.tess_api.GetUTF8Text()

# Create several Worker actors.
workers = [Worker.remote() for _ in range(4)]

# Execute tasks on them in parallel.
result_ids = [worker.execute.remote(image) for worker in workers]

# Get the results
results = ray.get(result_ids)

You can read more about Ray in the documentation . Note that I help develop Ray.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM