多处理池未获得所有内核

Question

我有一个代码需要一些并行化，为此我使用了 Python 的multiprocessing模块，特别是Pool class。 发生并行化的代码的相关部分看起来像这样

import multiprocessing as mp
import numpy as np

@jit( nopython=True )
def numba_product( a, b ):
        
    a_len = len(a)
    b_len = len(b)
    n     = len( a[0,:] )    
    c_res   = np.empty( (  a_len*b_len, n ), dtype=np.complex128 ) 
    c_count = 0  
    for i in range(a_len):
        for j in range( b_len ):            
            c_res[ c_count , : ] = np.multiply( a[i,:], b[ j, : ]  )          
            c_count += 1
            
    return c_res

def do_some_computations( shared_object, index ):

    d  = shared_object.get_dictionary_1()
            
    some_numpy_array_1 = shared_object.get_numpy_array_1( index ) #this gets a numpy array from 
                                                                  # shared object attribute, i.e.,
                                                                  # from shared_object class 
                                                                  # definition, the method returns
                                                                  # "self.the_array" attribute that
                                                                  # belongs to shared object, see 
                                                                  # dummy version of class definition 
                                                                  # below            
    mask_array_1       = shared_object.get_mask_array_1() # this gets a mask for the specified array        
    filtered_array_1   = some_numpy_array_1[ mask_array_1] #note that this defines a local new array, 
                                                           # but shouldn't modify some_numpy_array_1 
                                                           # ( I believe ) 
    
    s_keys             = shared_object.get_keys_for_index( index ) #gets the keys corresponding to 
                                                                   #that index to create a new array        
    
    v   = np.array( [ d1[ x ] for x in  s_keys  ] )

    final_result = numba_product( filtered_array_1, v )  # 
   

def pool_worker_function( index, args ):    
    shared_object = args[0] 
    result = do_some_computations( shared_object, index ) 
    return result    
        
    
def parallel_exec( shared_object, N ):
    number_processors      = mp.cpu_count()
    number_used_processors = number_processors - 1

    #try pool map method with a READ-ONLY object that is "shared_object".
    # This object contains two big dictionaries from which values are retrieved, 
    # and various NumPy arrays of considerable dimension           
    from itertools import repeat      

    pool    = mp.Pool( processes = number_used_processors )       
     
    a = list( np.linspace( 0, N, N ) )          
    
    args = ( shared_object, )     
    number_tasks = number_used_processors  
      
    n_chunck = int( ( len(a) - 1 )/number_tasks )
             
    result = pool.starmap( pool_worker_function, zip( a, repeat( args ) ), chunksize = n_chunck)              
    pool.close()        
    pool.join()           
    return result

问题：

我遇到的问题是，当我在 Unix 操作系统下运行它时，在 32 核系统上，我只观察到少数内核正在并行化......据我了解，Unix 提供自动os.fork()作为copy-on-write ，这意味着如果我的 shared_object 在调用期间没有被修改，那么并行化应该在没有额外的 memory 消耗的情况下发生，并且所有内核应该分别执行它们的任务？ 这是程序到达并行化部分时我看到的快照：

这些让我感到困惑，我确保 cpu.count() 提供的内核总数为 32。我观察到的另一件事是，在整个并行化过程中，可用 memory 的数量从 ~84 GiB 持续减少到 ~ 59 GiB。 This hints probably that copies of the "shared_object" class are being created with each process, therefore making a copy of all the dictionaries and NumPy arrays that the class contains. 我想绕过这个问题； 我想使用所有内核进行并行化，但老实说，我不知道这里发生了什么。

The code is expected to run in the Unix machine of 32 cores, but my own laptop has Windows OS, and here is a snapshot of what I see on Windows when I launch it ( although for what I have read, Windows does not support os.fork()方法，所以我猜对高 memory 消耗不足为奇？）。

如您所见，对操作系统的调用（红色）占用了非常高的 CPU 使用率。 在上面显示的 Linux 案例的快照中，情况似乎也是如此。

最后，我要强调的是，class“shared_object”的形式如下：

class shared_object():

    def __init__(): pass
    
    def store_dictionaries_and_arrays( dict_1, dict_2, array_1, array_2, ...  ):
        
        self.dict_1 = dict_1
        self.dict_2 = dict_2
        self.array_1 = array_1
        # same for all other arguments passed
    def get_dictionary_1():
        return self.dict_1
    def get_numpy_array_1():
        return self.array_1

但是对于更多属性，因此需要更多“获取”方法。 这是一个非常大的数据容器，因此我希望在执行并行化时没有它的副本，因为属性只能访问而不是修改，我在这里缺少什么？ 非常感谢任何帮助，这已经打击了我很长时间......非常感谢！

Answer 1

根据您的评论，我认为您只想做这样的事情：

def pool_worker_function(index, args):
    return do_some_computations(_shared_hack, index)

def parallel_exec(shared_object, N):
    global _shared_hack
    _shared_hack = shared_object

    # it'll use ncores processes by default
    with mp.Pool() as pool:
        return pool.map(pool_worker_function, range(N))

shared_object保存在全局某个地方，并让子进程在需要时将其拾取。

你正在做很多奇怪的事情，我已经去掉了这些东西，包括设置一个在任何地方都没有使用过的chuncks列表。 我也切换到使用range ，因为您也在使用list(np.linspace(0, N, N))来设置一些似乎损坏的索引。 例如， N=4会给你[0, 1.333, 2.667, 4]这看起来不像我想要索引数组的东西

多处理池未获得所有内核

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-02 12:56:36

多处理池未获得所有内核

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-02 12:56:36

解决方案1
0 已采纳 2020-07-02 12:56:36