I'm quite new with multithreading so I'm sorry if it's basic. I have some function that OCRs image files and I want to multithread the task. The function does not return anything, but only saves the text of the OCR dataset. The code is as follows:
start_time = time.time()
path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test'
listfiles = os.listdir(path)
filterfiles = [p for p in listfiles if p[-4:] == '.tif']
pool = Pool(processes=2)
result = pool.map(OCRimage,filterfiles)
pool.close()
pool.join()
print("--- %s seconds ---" % (time.time() - start_time))
When I run the code it seems like it gets stuck on pool.map()
. I ran it for 30 min which is way longer than the trial process took and it didn't produce on single output. I tested my function OCRimage and it didn't seem like it entered into the function a single time (using print(1)
as the first line of my OCRimage code). I'm wondering if someone could help me out. Thanks,
Cameron
EDIT (added OCRimage function):
The OCRimage function looks like this:
def OCRimage(f):
#This runs the magick bash script which splits a multi-image tif into multiple single image tiffs
process = subprocess.Popen(["magick", path + "\\" + f, path + "\\temp\\%d.tif"], shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print(process.communicate()[0])
#finds the number of pages for each tiff file (this might not be necassary but the all files in directory python command could access files randomly)
max1 = -1
for filename in os.listdir(path+'\\temp'):
if (max1 < int(filename[0:-4])):
max1 = int(filename[0:-4])
max1 = max1 + 1
text = ""
for each in range(0,max1):
im = Image.open(path + "\\temp\\"+ str(each) + ".tif")
text = text + pytesseract.image_to_string(im)
with open(path + "\\result\\OCR-"+f[0:-4]+".txt", 'w') as file:
file.write(text)
for f in os.listdir(path+'\\temp'):
os.remove(path + '\\temp\\' + f)
Edit2: Here are all the imports
import time
import subprocess
import os
import pytesseract
from PIL import Image
from multiprocessing import Pool
import multiprocessing
countcpus = multiprocessing.cpu_count()
EDIT3:
Running just OCRimage(f) by itself works fine. Instead of the multithreading code I just use this:
path = 'C:\\Users\\RNCZF01\\Documents\\Cameron-Fen\\Economics-Projects\\Patent-project\\similarity\\Patents\\OCR-test'
for p in os.listdir(path):
OCRimage(p)
This is a Minimal, Complete, and Verifiable Example that seems to show that the problem must be in your OCRimage
function (see the Windows section below for the real problem):
from multiprocessing import Pool
def OCRimage(file_name):
print "file_name = %s" % file_name
filterfiles = ["image%03d.tif" % n for n in range(5)]
pool = Pool(processes=2)
result = pool.map(OCRimage, filterfiles)
pool.close()
pool.join()
Output
file_name = image000.tif
file_name = image001.tif
file_name = image002.tif
file_name = image003.tif
file_name = image004.tif
I recommend these changes to the start of OCRimage
:
def OCRimage(file_name):
print "file_name = %s" % file_name
src = os.path.join([path, file_name])
dst = os.path.join([path, 'temp', '%d.tif'])
command_list = ['magick', src, dst]
# This runs the magick bash script which splits a multi-image tif into
# multiple single image tiffs
process = subprocess.Popen(command_list,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output, errors = process.communicate()
if process.returncode != 0:
print "Image processing failed for %s: %s" % (file_name, errors)
return
# The rest of your code goes here
It is important to verify that the return code from the subprocess is zero. If it is not zero, you really want to look at the errors
string.
Windows
When I ran the mcve on Windows, I got this exception:
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
When I changed the mcve to this, it worked:
from multiprocessing import Pool
def OCRimage(file_name):
print "file_name = %s" % file_name
def main():
filterfiles = ["image%03d.tif" % n for n in range(5)]
pool = Pool(processes=2)
result = pool.map(OCRimage, filterfiles)
pool.close()
pool.join()
if __name__ == '__main__':
main()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.