简体   繁体   English

使用多处理并行运行多个tesseract实例不返回任何结果

[英]running multiple tesseract instances in parallel using multiprocessing not returning any results

I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel.我正在编写一个 python 脚本,其中我使用 multiproccesing 库并行启动多个 tesseract 实例。 when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).当我使用多次调用 tesseract 但按顺序使用循环时,它可以工作。但是,当我尝试并行代码时,一切看起来都很好,但我没有得到任何结果(我等了 10 分钟)。

In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.在我的代码中,我尝试将多个 pdf 页面从原始多页 PDF 中拆分后进行 Ocrize。

Here's my code :这是我的代码:

def processPage(i):



    nameJPG="converted-"+str(i)+".jpg"
    nameHocr="converted-"+str(i)
    p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
    print "tesseract did the job for the ",str(i+1),"page" 

pool1=Pool(4)
    pool1.map(processPage, range(len(pdf.pages)))

Your code is launching a Pool and exiting before it finishes its job.您的代码正在启动一个Pool并在完成工作之前退出。 You need to call close and join .您需要调用closejoin

pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()

Alternatively, you can wait for its results.或者,您可以等待其结果。

pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))

As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3 or other wise you can hit google vision api they have lot of servers and there output is quite good too Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment.据我所知,如果您有四核并且同时运行 4 个进程,那么 pytesseract 将不允许多个进程,而 tesseract 将被阻塞,并且如果您需要为公司使用它并且您不想去,那么您将拥有高 CPU 使用率和其他东西使用 google vision api,您必须设置多个服务器并进行套接字编程以从不同的服务器请求文本,以便并行进程的数量少于您的服务器同时运行不同进程的能力,例如四核,它应该是 2 或 3或者其他明智的做法,您可以使用 google vision api 他们有很多服务器,并且输出也非常好在 tesseract 中禁用多处理也将有所帮助它可以通过在环境中设置 OMP_THREAD_LIMIT=1 来完成。 but you must not run multiple process at same servers for tesseract但是你不能在同一个服务器上为 tesseract 运行多个进程

See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 多处理和硒,并行运行多个浏览器? - Multiprocessing and selenium, running multiple browsers in parallel? 在python中使用多处理库运行pyglet的多个实例 - running multiple instances of pyglet with multiprocessing library in python 在python中并行导入多个模块(使用多处理) - Importing multiple modules at parallel in python(using multiprocessing) 多处理python无法并行运行 - Multiprocessing python not running in parallel 使用 python 并行运行 n 个 MATLAB 实例 - Running n MATLAB instances using python in parallel Python 多处理:并行运行多个 for 循环的每次迭代 - Python Multiprocessing: Running each iteration of multiple for loops in parallel Python Multiprocessing 在 Windows 上使用 Logging 和运行冻结返回结果 - Python Multiprocessing returning results with Logging and running frozen on Windows 在具有 SLURM 的 HPC 系统上使用 GNU Parallel 运行具有两个输入文件的 python 文件的多个实例 - Running multiple instances of a python file with two input files using GNU Parallel on an HPC system with SLURM 使用多处理在python3.7中并行运行不同的function - Running different function parallel in python3.7 using multiprocessing 使用 multiprocessing.Process 在 Python 中并行运行 function - Running a function in Python in Parallel using multiprocessing.Process
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM