简体繁体 English

如何加快在 Python 中的并行程序中运行的优化 CPU 绑定进程？

[英]How do I speed up an optimized CPU-bound process that runs within a parallelized program in Python?

原文 2022-01-20 14:13:13 2 1 python/ performance/ optimization/ parallel-processing/ python-multiprocessing

A Python program of mine uses the multiprocessing module to parallelize iterations of a search problem.我的一个 Python 程序使用multiprocessing模块来并行化搜索问题的迭代。 Besides doing other things, each iteration loops over a CPU-expensive process that is already optimized in Cython.除了做其他事情外，每次迭代都会循环一个已经在 Cython 中优化的 CPU 昂贵的进程。 Because this process gets called multiple times while looping, this significantly slows down the total runtime.因为这个过程在循环时被多次调用，这会显着减慢总运行时间。

What is the recommended way to achieve a speed-up in this case?在这种情况下，实现加速的推荐方法是什么？ As the expensive process can't be further CPU-optimized, I've considered parallelizing the loop.由于无法对昂贵的进程进行进一步的 CPU 优化，我考虑过并行化循环。 However, as the loop lives in an already parallelized (by multiprocessing ) program, I don't think this would be possible on the same machine.但是，由于循环存在于已经并行化（通过multiprocessing ）的程序中，我认为这在同一台机器上是不可能的。

My research on this has failed to find any best practices or any sort of direction.我对此的研究未能找到任何最佳实践或任何方向。

1 个解决方案

As a quick way to see if it might be possible to optimize your existing code, you might check your machines CPU usage while the code is running.作为查看是否可以优化现有代码的快速方法，您可以在代码运行时检查计算机 CPU 使用率。

If all your cores are ~100% then adding more processes etc isn't likely to improve things.如果你所有的核心都是~100%，那么添加更多的进程等不太可能改善事情。

In that case you could在那种情况下，你可以

1 - Try further algorithm optimisation (though best bang for the buck is to profile your code first to see where it's slow). 1 - 尝试进一步的算法优化（尽管最好的方法是先分析你的代码，看看它在哪里慢）。 Though if you've already been using Cython then likely this might have limited returns尽管如果您已经在使用 Cython，那么这可能会带来有限的回报

2 - Try a faster machine and/or with more cores 2 - 尝试更快的机器和/或更多的内核

Another approach however (one that I've used) is instead to develop a serverless design, and run your CPU intensive, parallel parts of your algorithm using any of the cloud vendors serverless models.然而，另一种方法（我使用过的方法）是开发无服务器设计，并使用任何云供应商无服务器模型运行 CPU 密集型、并行算法部分。

I've personally used AWS lamda, where we parallelized our code to run with 200+ simultaneous lambda processes, that is roughly equivalent to a 200+ core single machine.我个人使用过 AWS lamda，我们在其中并行化了我们的代码以同时运行 200 多个 lambda 进程，这大致相当于一台 200 多个内核的单机。

For us, this essentially resulted in a 50-100 times increase in performance (measured as reduction in total processing time) compared to running on a 8-core server.对我们而言，与在 8 核服务器上运行相比，这实质上导致性能提高了 50-100 倍（以总处理时间的减少来衡量）。

You do have to do more work to implement a serverless deployment model, and then wrapper code to manage everything, which isn't trivial.您必须做更多的工作来实现无服务器部署 model，然后使用包装器代码来管理所有内容，这并非易事。 However the ability to essentially scale infinitely horizontally may potentially make sense for you.但是，基本上无限水平扩展的能力可能对您有意义。