简体   繁体   English

已解决:Python 多处理 imap BrokenPipeError: [Errno 32] Broken pipe pdftoppm

[英]Solved: Python multiprocessing imap BrokenPipeError: [Errno 32] Broken pipe pdftoppm

Let me first say that this is not a duplicate of the other similar questions, where people tend to manage more closely the pool of workers.首先让我说,这不是其他类似问题的重复,人们倾向于更密切地管理工人池。

I have been struggling with the following exception thrown by my code when using multiprocessing.Pool.imap:在使用 multiprocessing.Pool.imap 时,我一直在努力解决我的代码抛出的以下异常:

  File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/pool.py", line 122, in worker
    put((job, i, (False, wrapped)))
  File "/usr/local/bin/homebrew/Cellar/python@2/2.7.17/lib/python2.7/multiprocessing/queues.py", line 390, in put
    return send(obj)
IOError: [Errno 32] Broken pipe

This arises at various points while executing the following main program:在执行以下主程序时,会在不同的点出现这种情况:

    pool = mp.Pool(num_workers)
    # Calculate a good chunksize (based on implementation of pool.map)
    chunksize, extra = divmod(lengthData, 4 * num_workers)
    if extra:
        chunksize += 1

    func = partial(pdf_to_txt, input_folder=inputFolder, junk_folder=imageJunkFolder, out_folder=outTextFolder,
                   log_name=log_name, log_folder=None,
                   empty_log=False, input_folder_iterator=None,
                   print_console=True)

    flag_vec = pool.imap(func, (dataFrame['testo accordo'][i] for i in range(lengthData)), chunksize)
    dataFrame['flags_conversion'] = pd.Series(flag_vec)
    dataFrame.to_excel("{0}logs/{1}.xlsx".format(outTextFolder, nameOut))
    pool.close()
    pool.join()

Just for reference, the partial function takes non-OCR PDF files, splits them into images for each page, and runs OCR using pytesseract.仅供参考,部分函数采用非 OCR PDF 文件,将它们拆分为每个页面的图像,并使用 pytesseract 运行 OCR。

I am running the code on the following machine:我在以下机器上运行代码:

This is a physical machine (PowerEdge R930) running RedHat 7.7 (Linux 3.10.0).

Processor:  Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz (x144)
Memory:     1.48 TiB
Swap:       7.81 GiB
Uptime:     21 days

Perhaps I should lower the chunk size?也许我应该降低块大小? It is really unclear to me.我真的不清楚。 I have noticed that the code seemed to work better when less workers were available on the server...我注意到当服务器上可用的工作人员较少时,代码似乎工作得更好......

After a lot of pain, I discovered the problem was with pdftoppm (that is, using pdf2image).经过一番痛苦,我发现问题出在pdftoppm(即使用pdf2image)上。 It appears that pdftoppm sometimes gets stuck without raising any exception.看来 pdftoppm 有时会卡住而不会引发任何异常。

If anyone ever runs into this problem, I warmly recommend switching to PyMuPDF to extract images from pdfs.如果有人遇到这个问题,我强烈建议切换到 PyMuPDF 从 pdf 中提取图像。 It is faster and more stable!它更快更稳定!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 BrokenPipeError:[Errno 32] Python 多处理 - BrokenPipeError: [Errno 32] Python Multiprocessing 我得到 BrokenPipeError: [Errno 32] Broken pipe 错误 python - I get BrokenPipeError: [Errno 32] Broken pipe error in python BrokenPipeError:[Errno 32] makefile 插座的 pipe 损坏? - BrokenPipeError: [Errno 32] Broken pipe for makefile socket? 为什么我不断收到 [BrokenPipeError: [Errno 32] Broken pipe],无论我的池中有多少工作人员在 python3.8 中使用多处理库? - Why do I keep getting [BrokenPipeError: [Errno 32] Broken pipe] no matter the number of workers in my Pool with multiprocessing lib in python3.8? BrokenPipeError: [Errno 32] 运行 GAN 时管道损坏错误 - BrokenPipeError: [Errno 32] Broken pipe error when running GANs Matlab 服务器与树莓派上的 python 客户端 BrokenPipeError: [Errno 32] Broken pipe - Matlab Server with python client on raspberry pi BrokenPipeError: [Errno 32] Broken pipe conn.send('Hi'.encode()) BrokenPipeError: [Errno 32] Broken pipe (SOCKET) - conn.send('Hi'.encode()) BrokenPipeError: [Errno 32] Broken pipe (SOCKET) 获取 BrokenPipeError:[Errno 32] 发送第二个套接字 MSG 时损坏 pipe - Getting BrokenPipeError: [Errno 32] Broken pipe When Sending Second Socket MSG 错误:[Errno 32] python代码中的管道损坏 - Error: [Errno 32] Broken pipe in python code Python socket errno 32 断管 - Python socket errno 32 broken pipe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM