使用多处理并行运行多个tesseract实例不返回任何结果

Question

I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel.我正在编写一个 python 脚本，其中我使用 multiproccesing 库并行启动多个 tesseract 实例。 when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).当我使用多次调用 tesseract 但按顺序使用循环时，它可以工作。但是，当我尝试并行代码时，一切看起来都很好，但我没有得到任何结果（我等了 10 分钟）。

In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.在我的代码中，我尝试将多个 pdf 页面从原始多页 PDF 中拆分后进行 Ocrize。

Here's my code :这是我的代码：

def processPage(i):



    nameJPG="converted-"+str(i)+".jpg"
    nameHocr="converted-"+str(i)
    p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
    print "tesseract did the job for the ",str(i+1),"page" 

pool1=Pool(4)
    pool1.map(processPage, range(len(pdf.pages)))

Answer 1

Your code is launching a Pool and exiting before it finishes its job.您的代码正在启动一个Pool并在完成工作之前退出。 You need to call close and join .您需要调用close和join 。

pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()

Alternatively, you can wait for its results.或者，您可以等待其结果。

pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))

Answer 2

As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3 or other wise you can hit google vision api they have lot of servers and there output is quite good too Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment.据我所知，如果您有四核并且同时运行 4 个进程，那么 pytesseract 将不允许多个进程，而 tesseract 将被阻塞，并且如果您需要为公司使用它并且您不想去，那么您将拥有高 CPU 使用率和其他东西使用 google vision api，您必须设置多个服务器并进行套接字编程以从不同的服务器请求文本，以便并行进程的数量少于您的服务器同时运行不同进程的能力，例如四核，它应该是 2 或 3或者其他明智的做法，您可以使用 google vision api 他们有很多服务器，并且输出也非常好在 tesseract 中禁用多处理也将有所帮助它可以通过在环境中设置 OMP_THREAD_LIMIT=1 来完成。 but you must not run multiple process at same servers for tesseract但是你不能在同一个服务器上为 tesseract 运行多个进程

See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167见https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167

使用多处理并行运行多个tesseract实例不返回任何结果

问题描述

2 个解决方案

解决方案1
0 2017-06-20 18:19:22

解决方案2
0 2020-11-12 07:06:02

使用多处理并行运行多个tesseract实例不返回任何结果

问题描述

2 个解决方案

解决方案1 0 2017-06-20 18:19:22

解决方案2 0 2020-11-12 07:06:02

解决方案1
0 2017-06-20 18:19:22

解决方案2
0 2020-11-12 07:06:02