简体   繁体   中英

Python Multiprocessing: Pool.map() seems to not call function at all

I'm quite new with multithreading so I'm sorry if it's basic. I have some function that OCRs image files and I want to multithread the task. The function does not return anything, but only saves the text of the OCR dataset. The code is as follows:

start_time = time.time()
path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test'
listfiles = os.listdir(path)

filterfiles = [p for p in listfiles if p[-4:] == '.tif']

pool = Pool(processes=2)

result = pool.map(OCRimage,filterfiles)

pool.close()
pool.join()

print("--- %s seconds ---" % (time.time() - start_time))

When I run the code it seems like it gets stuck on pool.map() . I ran it for 30 min which is way longer than the trial process took and it didn't produce on single output. I tested my function OCRimage and it didn't seem like it entered into the function a single time (using print(1) as the first line of my OCRimage code). I'm wondering if someone could help me out. Thanks,

Cameron

EDIT (added OCRimage function):

The OCRimage function looks like this:

def OCRimage(f):
    #This runs the magick bash script which splits a multi-image tif into multiple single image tiffs
    process = subprocess.Popen(["magick", path + "\\" + f, path + "\\temp\\%d.tif"], shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    print(process.communicate()[0])

    #finds the number of pages for each tiff file (this might not be necassary but the all files in directory python command could access files randomly)
    max1 = -1
    for filename in os.listdir(path+'\\temp'):    
        if (max1 < int(filename[0:-4])):
            max1 = int(filename[0:-4])
    max1 = max1 + 1

    text = ""
    for each in range(0,max1):
        im = Image.open(path + "\\temp\\"+ str(each) + ".tif")
        text = text + pytesseract.image_to_string(im)
    with open(path + "\\result\\OCR-"+f[0:-4]+".txt", 'w') as file:
        file.write(text)    

    for f in os.listdir(path+'\\temp'):
        os.remove(path + '\\temp\\' + f)

Edit2: Here are all the imports

import time
import subprocess
import os
import pytesseract
from PIL import Image

from multiprocessing import Pool
import multiprocessing
countcpus = multiprocessing.cpu_count()

EDIT3:

Running just OCRimage(f) by itself works fine. Instead of the multithreading code I just use this:

path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test'
for p in os.listdir(path):
    OCRimage(p)

This is a Minimal, Complete, and Verifiable Example that seems to show that the problem must be in your OCRimage function (see the Windows section below for the real problem):

from multiprocessing import Pool

def OCRimage(file_name):
    print "file_name = %s" % file_name

filterfiles = ["image%03d.tif" % n for n in range(5)]

pool = Pool(processes=2)
result = pool.map(OCRimage, filterfiles)

pool.close()
pool.join()

Output

file_name = image000.tif
file_name = image001.tif
file_name = image002.tif
file_name = image003.tif
file_name = image004.tif

I recommend these changes to the start of OCRimage :

def OCRimage(file_name):
    print "file_name = %s" % file_name
    src = os.path.join([path, file_name])
    dst = os.path.join([path, 'temp', '%d.tif'])
    command_list = ['magick', src, dst]
    # This runs the magick bash script which splits a multi-image tif into
    # multiple single image tiffs
    process = subprocess.Popen(command_list,
                               shell=True,
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    output, errors = process.communicate()
    if process.returncode != 0:
        print "Image processing failed for %s: %s" % (file_name, errors)
        return
    # The rest of your code goes here

It is important to verify that the return code from the subprocess is zero. If it is not zero, you really want to look at the errors string.

Windows

When I ran the mcve on Windows, I got this exception:

RuntimeError: 
            Attempt to start a new process before the current process
            has finished its bootstrapping phase.

            This probably means that you are on Windows and you have
            forgotten to use the proper idiom in the main module:

                if __name__ == '__main__':
                    freeze_support()
                    ...

            The "freeze_support()" line can be omitted if the program
            is not going to be frozen to produce a Windows executable.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main

When I changed the mcve to this, it worked:

from multiprocessing import Pool

def OCRimage(file_name):
    print "file_name = %s" % file_name

def main():
    filterfiles = ["image%03d.tif" % n for n in range(5)]
    pool = Pool(processes=2)
    result = pool.map(OCRimage, filterfiles)
    pool.close()
    pool.join()

if __name__ == '__main__':
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM