简单 CPU/GPU 上的最大并行进程数

Question

I am trying to run a particle filter with 3000 independent particles.我正在尝试运行具有 3000 个独立粒子的粒子过滤器。 More specifically, I would like to run 3000 (simple) computations in parallel at the same time, so that the computation time remains short.更具体地说，我想同时并行运行 3000 个（简单）计算，这样计算时间就会很短。

This task is designed for experimental applications on a laboratory equipment, so it has to be run on a local laptop.此任务专为实验室设备上的实验应用而设计，因此必须在本地笔记本电脑上运行。 I cannot rely on a distant cluster of computers, and the computers that will be used are unlikely to have fancy Nvidia graphic cards.我不能依赖遥远的计算机集群，而且将要使用的计算机不太可能配备精美的 Nvidia 显卡。 For instance, the current computer I'm working with has an Intel Core i7-8650U CPU and an Intel UHD Graphics 620 GPU.例如，我正在使用的当前计算机具有 Intel Core i7-8650U CPU 和 Intel UHD Graphics 620 GPU。

Using the mp.cpu_count() from the multiprocessing Python library tells me that I have 8 processors, which is too few for my problem (I need to run several thousands of processes in parallel).使用multiprocessing Python 库中的mp.cpu_count()告诉我我有 8 个处理器，这对于我的问题来说太少了（我需要并行运行数千个进程）。 I thus looked towards GPU-based solutions, and especially at PyOpenCL .因此，我着眼于基于 GPU 的解决方案，尤其是PyOpenCL 。 The Intel UHD Graphics 620 GPU is supposed to have only 24 processors, does it mean I can only use it to run 24 processes at the same time in parallel? Intel UHD Graphics 620 GPU 应该只有 24 个处理器，这是否意味着我只能使用它同时并行运行 24 个进程？

More generally, is my problem (running 3000 processes in parallel on a simple laptop using Python) realistic, and if yes which software solution would you recommend?更一般地说，我的问题（使用 Python 在简单的笔记本电脑上并行运行 3000 个进程）是否现实，如果是，您会推荐哪种软件解决方案？

EDIT编辑

Here is my pseudo code.这是我的伪代码。 At each time step i , I am calling the function posterior_update .在每个时间步骤i ，我都调用posterior_update 。 This function uses 3000 times and independently (once for each particle) the function approx_likelihood , which seems hardly vectorizable.这个 function 使用 3000 次并且独立地（每个粒子一次） function approx_likelihood ，这似乎很难矢量化。 Ideally, I would like these 3000 calls to take place independently and in parallel.理想情况下，我希望这 3000 个调用独立并并行进行。

import numpy as np
import scipy.stats
from collections import Counter
import random
import matplotlib.pyplot as plt
import os
import time

# User's inputs ##############################################################

# Numbers of particles
M_out           = 3000

# Defines a bunch of functions ###############################################

def approx_likelihood(i,j,theta_bar,N_range,q_range,sigma_range,e,xi,M_in):
    
    return sum(scipy.stats.norm.pdf(e[i],loc=q_range[theta_bar[j,2]]*kk,scale=sigma_range[theta_bar[j,3]])* \
          xi[nn,kk]/M_in for kk in range(int(N_range[theta_bar[j,0]]+1)) for nn in range(int(N_range[theta_bar[j,0]]+1)))
    
def posterior_update(i,T,e,M_out,M_in,theta,N_range,p_range,q_range,sigma_range,tau_range,X,delta_t,ML):
         
    theta_bar = np.zeros([M_out,5], dtype=int)
    x_bar = np.zeros([M_out,M_in,2], dtype=int)
    u = np.zeros(M_out)
    x_tilde = np.zeros([M_out,M_in,2], dtype=int)    
    w = np.zeros(M_out)
    
    # Loop over the outer particles 
    for j in range(M_out):
                    
        # Computes the approximate likelihood u
        u[j] = approx_likelihood(i,j,theta_bar,N_range,q_range,sigma_range,e,xi,M_in)
    
    ML[i,:] = theta_bar[np.argmax(u),:]        
    # Compute the normalized weights w
    w = u/sum(u)
    # Resample
    X[i,:,:,:],theta[i,:,:] = resample(M_out,w,x_tilde,theta_bar)  
       
    return X, theta, ML

# Loop over time #############################################################
    
for i in range(T):
    
    print('Progress {0}%'.format(round((i/T)*100,1)))
        
    X, theta, ML = posterior_update(i,T,e,M_out,M_in,theta,N_range,p_range,q_range,sigma_range,tau_range,X,delta_t,ML)

Answer 1

These are some ideas, not an answer to your question:这些是一些想法，而不是您问题的答案：

Your main concern about how to determine the number of parallel processes you can run, is not so simple.您对如何确定可以运行的并行进程数量的主要关注并不是那么简单。 Basically you can think of your computer running as many processes in parallel as CPU cores you have.基本上，您可以认为您的计算机并行运行与您拥有的 CPU 内核一样多的进程。 But this ultimately depends on the operating system, the current work load of your computer, etc. Besides, you can send your data to your processes in chunks, not necessarily one item at a time.但这最终取决于操作系统、计算机当前的工作负载等。此外，您可以将数据以块的形式发送到您的进程，而不必一次一项。 Or you can partition your data into the processes you have, eg 6 processes with 500 items each = 3000 items.或者，您可以将数据划分为您拥有的进程，例如 6 个进程，每个进程有 500 个项目 = 3000 个项目。 The optimum combination will require some trial and error.最佳组合将需要一些试验和错误。
The GPU, on the other hand, has an enormous amount of workers available.另一方面，GPU 有大量可用的工人。 If you have the Nvidia drivers and OpenCL installed, issue the command clinfo in your terminal to have an idea of the capabilities of your hardware.如果您安装了 Nvidia 驱动程序和 OpenCL，请在终端中发出命令clinfo以了解硬件的功能。
One problem I see with using the GPU with your code, is that you need to pass the instructions to your device in C language.我在代码中使用 GPU 时看到的一个问题是，您需要以 C 语言将指令传递给您的设备。 Your approx_likelihood function contains code dependent on libraries, that would be very difficult to replicate in C.您的approx_likelihood function 包含依赖于库的代码，这些代码很难在 C 中复制。
However, if you estimate that you are using these libraries to do something that you could code in C, give it a try.但是，如果您估计您正在使用这些库来做一些您可以在 C 中编码的事情，请尝试一下。 You could also consider using Numba.您也可以考虑使用 Numba。
I would start by using python's multiprocessing .我将从使用 python 的multiprocessing开始。 Something in these lines:这些行中的一些东西：

import multiprocessing as mp

def f(j):
    return approx_likelihood(i, j, theta_bar, N_range, q_range, sigma_range, e, xi, M_in)

with mp.Pool() as pool:
    u = pool.map(f, range(M_out), chunksize=50)

简单 CPU/GPU 上的最大并行进程数

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-17 00:22:56

简单 CPU/GPU 上的最大并行进程数

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-17 00:22:56

解决方案1
1 已采纳 2021-02-17 00:22:56