并行化数百万次 Numpy 函数的迭代

Question

I'm running the below function compare on 20 million different combinations of the parameter sample where sample is a 1D array composed of 100 1s and 0s.我正在运行以下函数compare参数sample 2000 万个不同组合，其中sample是由 100 个 1 和 0 组成的一维数组。

compare takes a couple of other arrays along with sample and uses them to perform a few dot products, exponentiate those dot products, and then compare them relative to each other. compare将几个其他数组与sample一起使用，并使用它们执行一些点积，对这些点积求幂，然后将它们相互比较。 These other arrays stay the same.这些其他数组保持不变。

On my laptop, it takes about an hour to run through all 20 million combinations.在我的笔记本电脑上，遍历所有 2000 万个组合大约需要一个小时。

I'm looking for ways to make it go quicker.我正在寻找使其更快进行的方法。 I am open to both improving the below code and using libraries such as dask which take advantage of parallel processing.我愿意改进以下代码并使用诸如 dask 之类的库来利用并行处理。

Notes:笔记：

The comments on each line in compare show a very rough estimate of how long that line takes on my machine. compare每一行的注释显示了该行在我的机器上花费的时间的粗略估计。 They're the result of a %%timeit on the line on its own outside the function.它们是函数外部的 %%timeit 本身的结果。
The inputs to compare are not actually randomly generated in my use case在我的用例中，要compare的输入实际上并不是随机生成的

def compare(sample, competition_exp_dot, preferences): # 140 µs
    sample_exp_dot = np.exp(preferences @ sample) #30.3 µs
    all_competitors = np.append(sample_exp_dot.reshape(-1, 1), competition_exp_dot, 1) # 5 µs
    all_results = all_products/all_competitors.sum(axis=1)[:,None] #27.4 µs

    return all_results.mean(axis=0) #20.6 µs

#these inputs to the above function stay the same
preferences = np.random.random((1000,100))
competition = np.array([np.random.randint(0,2,100), np.random.randint(0,2,100)])
competition_exp_dot = np.exp(preferences @ competition.T)

# the function is run with 20,000,000 variations of sample
population = np.random.randint(0,2,(20000000,100))
result = [share_calc(sample, competition_exp_dot, preferences) for sample in population]

Answer 1

There are many ways to accelerate simple array programming code like this:有很多方法可以加速像这样的简单数组编程代码：

You can use a tool like Numba, which will fuse some of the work, and also provide some options for single-node multi-core parallelism可以使用Numba这样的工具，它会融合一些工作，同时也为单节点多核并行提供了一些选项
You can use a tool like Dask to scale this onto multiple cores of a single machine (also possible with Numba) or across a cluster您可以使用 Dask 之类的工具将其扩展到单个机器的多个内核（也可以使用 Numba）或跨集群扩展
You can use one of the GPU array libraries, like Torch, TensorFlow, CuPy, or Jax to run this on a GPU您可以使用 GPU 阵列库之一，如 Torch、TensorFlow、CuPy 或 Jax 在 GPU 上运行它

You can also do any mixture of the above.您也可以将上述方法混合使用。

Answer 2

You can consider the following:您可以考虑以下几点：

import torch
import numpy as np
x = np.array([[1,2,3],[4,5,6]])
b = torch.from_numpy(x)
if torch.cuda.is_available():
    device = torch.device("cuda")
b = b.to(device)

Answer 3

I implemented numba as suggested by MRocklin.我按照 MRocklin 的建议实施了 numba。 The result is about 4 times faster on my machine.结果在我的机器上快了大约 4 倍。

Modified Numba Version修改后的Numba版本

import numba as nb

@nb.jit
def nb_compare(sample, competition_exp_dot, preferences):
    sample_exp_dot = np.exp(preferences @ sample)
    all_competitors = np.append(sample_exp_dot.reshape(-1, 1), competition_exp_dot, 1)
    all_results = (all_competitors.T/all_competitors.sum(axis=1)).T

    return np_mean(all_results, 0) # see source for np_mean in notes below

Comparable Numpy Version可比的 Numpy 版本

import numpy as np
def np_compare(sample, competition_exp_dot, preferences):
    sample_exp_dot = np.exp(preferences @ sample)
    all_competitors = np.append(sample_exp_dot.reshape(-1, 1), competition_exp_dot, 1)
    all_results = (all_competitors.T/all_competitors.sum(axis=1)).T

    return all_results.mean(axis=0)

Timing Comparison时序比较

Setup:设置：

preferences = np.random.random((1000,100)).astype(np.float32)
competition = np.array([np.random.randint(0,2,100), np.random.randint(0,2,100)]).astype(np.float32)
competition_exp_dot = np.exp(preferences @ competition.T)
sample = np.random.randint(0,2,100)

%timeit np_compare(sample, competition_exp_dot, preferences)
"210 µs ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)"


%timeit -n 10000 nb_compare(population[0], competition_exp_dot, preferences)
"52.4 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)"

Notes笔记

Numba doesn't support optional parameters like axis for np.mean and returns a TypingError. Numba 不支持可选参数，例如 np.mean 的轴并返回一个 TypingError。 In my numba code, I use call below version of np_mean instead.在我的 numba 代码中，我使用调用以下版本的np_mean代替。

Credit to joelrich 归功于 joelrich

import numba as nb, numpy as np

# fix to use np.mean along axis=0 (numba doesn't support optional arguments for np.mean)
# credit to: joelrich https://github.com/numba/numba/issues/1269#issuecomment-472574352
@nb.njit
def np_apply_along_axis(func1d, axis, arr):
  assert arr.ndim == 2
  assert axis in [0, 1]
  if axis == 0:
    result = np.empty(arr.shape[1])
    for i in range(len(result)):
      result[i] = func1d(arr[:, i])
  else:
    result = np.empty(arr.shape[0])
    for i in range(len(result)):
      result[i] = func1d(arr[i, :])
  return result

@nb.njit
def np_mean(array, axis):
  return np_apply_along_axis(np.mean, axis, array)

并行化数百万次 Numpy 函数的迭代

问题描述

3 个解决方案

解决方案1
3 2020-01-01 03:18:34

解决方案2
1 2019-12-29 05:44:06

解决方案3
0 已采纳 2020-01-01 15:45:21

Modified Numba Version修改后的Numba版本

Comparable Numpy Version可比的 Numpy 版本

Timing Comparison时序比较

Notes笔记

并行化数百万次 Numpy 函数的迭代

问题描述

3 个解决方案

解决方案1 3 2020-01-01 03:18:34

解决方案2 1 2019-12-29 05:44:06

解决方案3 0 已采纳 2020-01-01 15:45:21

Modified Numba Version修改后的Numba版本

Comparable Numpy Version可比的 Numpy 版本

Timing Comparison时序比较

Notes笔记

解决方案1
3 2020-01-01 03:18:34

解决方案2
1 2019-12-29 05:44:06

解决方案3
0 已采纳 2020-01-01 15:45:21