使用dask.distributed強制或顯式重新平衡數據

Question

我有一個包含4個工作程序的Dask-MPI集群，一個3D網格數據集已加載到Dask數組中，並分成4個塊。 我的應用程序要求我為每個工作人員准確地運行一個任務，最好每個任務運行一個塊。 我遇到的麻煩是以可靠，可復制的方式使塊分布在整個群集中。 具體來說，如果我運行array.map_blocks（foo），則foo在每個塊的相同工作線程上運行。

Client.rebalance（）似乎應該執行我想要的操作，但是它仍然將所有或大多數塊留在同一個worker上。 作為測試，我嘗試將數據重新分塊為128個塊並再次運行，這導致7或8個塊移至其他數據集。 這暗示Dask正在使用啟發式方法來決定何時自動移動塊，但是並沒有給我一種強制均勻分布塊的方法。

這是我一直在嘗試的簡單測試腳本（連接到具有4個工作人員/等級的集群）。

#connect to the Dask scheduler
from dask.distributed import Client, Sub, Pub, fire_and_forget
client = Client(scheduler_file='../scheduler.json', set_as_default=True)


#load data into a numpy array
import numpy as np
npvol = np.array(np.fromfile('/home/nleaf/data/RegGrid/Vorts_t50_128x128x128_f32.raw', dtype=np.float32))
npvol = npvol.reshape([128,128,128])

#convert numpy array to a dask array
import dask.array as da
ar = da.from_array(npvol).rechunk([npvol.shape[0], npvol.shape[1], npvol.shape[2]/N])


def test(ar):
    from mpi4py import MPI
    rank = MPI.COMM_WORLD.Get_rank()
    return np.array([rank], ndmin=3, dtype=np.int)

client.rebalance()
print(client.persist(ar.map_blocks(test, chunks=(1,1,1))).compute())

在數十次測試運行中，此代碼一次返回了第3級的塊，否則所有所有塊均位於第0級。

Answer 1

由於您的總數據集並不大，因此對from_array的初始調用僅創建一個塊，因此將其分配給一個工作程序（否則，您可以使用chunks=進行指定）。 以下rechunk盡可能不移動數據。

假設每個工作人員都可以訪問您的文件，那么最好首先將工作塊中的數據塊加載。

您將需要類似的功能

def get_chunk(fn, offset, count, shape, dtype):
    with open(fn, 'rb') as f:
        f.seek(offset)
        return np.fromfile(f, dtype=dtype, count=count).reshape(shape)

並為每個塊傳遞不同的偏移量。

parts = [da.from_delayed(dask.delayed(get_chunk)(fn, offset, count, shape, dtype), shape, dtype) for offset in [...]]
arr = da.concat(parts)

這非常類似於Intake中的npy源代碼自動完成的操作： https : //github.com/intake/intake/blob/master/intake/source/npy.py#L11

使用dask.distributed強制或顯式重新平衡數據

問題描述

1 個解決方案

解決方案1
0 已采納 2019-07-17 21:40:47

使用dask.distributed強制或顯式重新平衡數據

問題描述

1 個解決方案

解決方案1 0 已采納 2019-07-17 21:40:47

解決方案1
0 已采納 2019-07-17 21:40:47