我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

Question

问题：

我正在尝试提高 Python 中的空气动力学 function 的速度。

Function 套装：

import numpy as np
from numba import njit

def calculate_velocity_induced_by_line_vortices(
    points, origins, terminations, strengths, collapse=True
):

    # Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).
    # This will allow NumPy to broadcast the upcoming subtractions.
    points = np.expand_dims(points, axis=1)
    
    # Define the vectors from the vortex to the points. r_1 and r_2 now both are of
    # shape (N x M x 3). Each row/column pair holds the vector associated with each
    # point/vortex pair.
    r_1 = points - origins
    r_2 = points - terminations
    
    r_0 = r_1 - r_2
    r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
    r_1_cross_r_2_absolute_magnitude = (
        r_1_cross_r_2[:, :, 0] ** 2
        + r_1_cross_r_2[:, :, 1] ** 2
        + r_1_cross_r_2[:, :, 2] ** 2
    )
    r_1_length = nb_2d_explicit_norm(r_1)
    r_2_length = nb_2d_explicit_norm(r_2)
    
    # Define the radius of the line vortices. This is used to get rid of any
    # singularities.
    radius = 3.0e-16
    
    # Set the lengths and the absolute magnitudes to zero, at the places where the
    # lengths and absolute magnitudes are less than the vortex radius.
    r_1_length[r_1_length < radius] = 0
    r_2_length[r_2_length < radius] = 0
    r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
    
    # Calculate the vector dot products.
    r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)
    r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)
    
    # Calculate k and then the induced velocity, ignoring any divide-by-zero or nan
    # errors. k is of shape (N x M)
    with np.errstate(divide="ignore", invalid="ignore"):
        k = (
            strengths
            / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
            * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
        )
    
        # Set the shape of k to be (N x M x 1) to support numpy broadcasting in the
        # subsequent multiplication.
        k = np.expand_dims(k, axis=2)
    
        induced_velocities = k * r_1_cross_r_2
    
    # Set the values of the induced velocity to zero where there are singularities.
    induced_velocities[np.isinf(induced_velocities)] = 0
    induced_velocities[np.isnan(induced_velocities)] = 0

    if collapse:
        induced_velocities = np.sum(induced_velocities, axis=1)

    return induced_velocities


@njit    
def nb_2d_explicit_norm(vectors):
    return np.sqrt(
        (vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2
    )


@njit
def nb_2d_explicit_cross(a, b):
    e = np.zeros_like(a)
    e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]
    e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]
    e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]
    return e

语境：

这个 function 由Ptera Software使用，这是一款用于扑翼空气动力学的开源求解器。 如以下配置文件 output 所示，它是迄今为止 Ptera Software 运行时间的最大贡献者。

目前，Ptera Software 运行一个典型案例只需 3 多分钟，我的目标是在 1 分钟内完成。

function 包含一组点、起点、终点和强度。 在每个点上，它都会找到由线涡流引起的诱导速度，这些线涡流的特征在于起点、终点和强度的组。 如果塌陷是真的，那么 output 是由于涡流在每个点处引起的累积速度。 如果为 false，则 function 输出每个涡流对每个点的速度的贡献。

在典型的运行过程中，速度 function 被调用大约 2000 次。 起初，调用涉及输入相对较小的向量 arguments（大约 200 个点、起点、终点和强度）。 后来的调用涉及大量输入 arguments（大约 400 个点和大约 6,000 个起点、终点和强度）。 一个理想的解决方案对于所有大小的输入都是快速的，但是提高大输入调用的速度更为重要。

为了进行测试，我建议使用您自己的 function 实现运行以下脚本：

import timeit

import matplotlib.pyplot as plt
import numpy as np

n_repeat = 2
n_execute = 10 ** 3
min_oom = 0
max_oom = 3

times_py = []

for i in range(max_oom - min_oom + 1):
    n_elem = 10 ** i
    n_elem_pretty = np.format_float_scientific(n_elem, 0)
    print("Number of elements: " + n_elem_pretty)

    # Benchmark Python.
    print("\tBenchmarking Python...")
    setup = '''
import numpy as np

these_points = np.random.random((''' + str(n_elem) + ''', 3))
these_origins = np.random.random((''' + str(n_elem) + ''', 3))
these_terminations = np.random.random((''' + str(n_elem) + ''', 3))
these_strengths = np.random.random(''' + str(n_elem) + ''')

def calculate_velocity_induced_by_line_vortices(points, origins, terminations,
                                                strengths, collapse=True):
    pass
    '''
    statement = '''
results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,
                                                           these_terminations,
                                                           these_strengths)
    '''
    
    times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)
    time_py = min(times)/n_execute
    time_py_pretty = np.format_float_scientific(time_py, 2)
    print("\t\tAverage Time per Loop: " + time_py_pretty + " s")

    # Record the times.
    times_py.append(time_py)

sizes = [10 ** i for i in range(max_oom - min_oom + 1)]

fig, ax = plt.subplots()

ax.plot(sizes, times_py, label='Python')
ax.set_xscale("log")
ax.set_xlabel("Size of List or Array (elements)")
ax.set_ylabel("Average Time per Loop (s)")
ax.set_title(
    "Comparison of Different Optimization Methods\nBest of "
    + str(n_repeat)
    + " Runs, each with "
    + str(n_execute)
    + " Loops"
)
ax.legend()
plt.show()

以前的尝试：

我之前尝试加速这个 function 涉及对其进行矢量化（效果很好，所以我保留了这些更改）并尝试了 Numba 的 JIT 编译器。 我对 Numba 的结果好坏参半。 当我尝试在整个速度 function 的修改版本上使用 Numba 时，我的结果比以前慢了很多。 但是，我发现 Numba 显着加快了我在上面实现的叉积和范数函数。

更新：

更新1：

根据 Mercury 的评论（已被删除），我替换了

points = np.expand_dims(points, axis=1)
r_1 = points - origins
r_2 = points - terminations

两次调用以下 function：

@njit
def subtract(a, b):
    c = np.empty((a.shape[0], b.shape[0], 3))
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            for k in range(3):
                c[i, j, k] = a[i, k] - b[j, k]
    return c

这导致速度从 227 秒增加到 220 秒。 不过这样更好。 它仍然不够快。

我还尝试将 njit fastmath 标志设置为 true，并使用 numba function 而不是调用 np.einsum。 都没有提高速度。

更新 2：

使用 Jérôme Richard 的回答，运行时间现在是 156 秒，减少了 29%，我很满意接受这个答案，但如果您认为可以改进他们的工作，请随时提出其他建议！

Answer 1

首先，如果您主要使用parallel=True和prange手动请求它，Numba 可以执行并行计算，从而产生更快的代码。 这对于大的 arrays 很有用（但对小的不适用）。

此外，您的计算主要是memory bound 。 因此，您应该避免创建大的 arrays 当它们没有被多次重用时，或者更一般地当它们不能被重新计算时（以相对便宜的方式）。 例如r_0就是这种情况。

此外， memory 访问模式很重要：当 memory 中的访问是连续的并且缓存/RAM 的使用效率更高时，向量化更有效。 因此， arr[0, :, :] = 0应该比arr[:, :, 0] = 0更快。 同样， arr[:, :, 0] = arr[:, :, 1] = 0应该比arr[:, :, 0:2] = 0慢，因为前者执行到非连续 memory 通过，而后者执行只有一个更连续的 memory 通行证。 有时，转置数据可能会有所帮助，以便以下计算更快。

此外，Numpy 往往会创建许多临时的 arrays ，分配成本很高。 当输入 arrays 很小时，这是一个大问题。 在大多数情况下，Numba jit 可以避免这种情况。

最后，关于您的计算，最好将GPU用于大型 arrays（绝对不适合小型）。 你可以看一下cupy或clpy很容易做到这一点。

这是在 CPU 上工作的优化实现：

import numpy as np
from numba import njit, prange

@njit(parallel=True)
def subtract(a, b):
    c = np.empty((a.shape[0], b.shape[0], 3))
    for i in prange(c.shape[0]):
        for j in range(c.shape[1]):
            for k in range(3):
                c[i, j, k] = a[i, k] - b[j, k]
    return c

@njit(parallel=True)
def nb_2d_explicit_norm(vectors):
    res = np.empty((vectors.shape[0], vectors.shape[1]))
    for i in prange(res.shape[0]):
        for j in range(res.shape[1]):
            res[i, j] = np.sqrt(vectors[i, j, 0] ** 2 + vectors[i, j, 1] ** 2 + vectors[i, j, 2] ** 2)
    return res

# NOTE: better memory access pattern
@njit(parallel=True)
def nb_2d_explicit_cross(a, b):
    e = np.empty(a.shape)
    for i in prange(e.shape[0]):
        for j in range(e.shape[1]):
            e[i, j, 0] = a[i, j, 1] * b[i, j, 2] - a[i, j, 2] * b[i, j, 1]
            e[i, j, 1] = a[i, j, 2] * b[i, j, 0] - a[i, j, 0] * b[i, j, 2]
            e[i, j, 2] = a[i, j, 0] * b[i, j, 1] - a[i, j, 1] * b[i, j, 0]
    return e

# NOTE: avoid the slow building of temporary arrays
@njit(parallel=True)
def cross_absolute_magnitude(cross):
    return cross[:, :, 0] ** 2 + cross[:, :, 1] ** 2 + cross[:, :, 2] ** 2

# NOTE: avoid the slow building of temporary arrays again and multiple pass in memory
# Warning: do the work in-place
@njit(parallel=True)
def discard_singularities(arr):
    for i in prange(arr.shape[0]):
        for j in range(arr.shape[1]):
            for k in range(3):
                if np.isinf(arr[i, j, k]) or np.isnan(arr[i, j, k]):
                    arr[i, j, k] = 0.0

@njit(parallel=True)
def compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length):
    return (strengths
        / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
        * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
    )

@njit(parallel=True)
def rDotProducts(b, c):
    assert b.shape == c.shape and b.shape[2] == 3
    n, m = b.shape[0], b.shape[1]
    ab = np.empty((n, m))
    ac = np.empty((n, m))
    for i in prange(n):
        for j in range(m):
            ab[i, j] = 0.0
            ac[i, j] = 0.0
            for k in range(3):
                a = b[i, j, k] - c[i, j, k]
                ab[i, j] += a * b[i, j, k]
                ac[i, j] += a * c[i, j, k]
    return (ab, ac)

# Compute `np.sum(arr, axis=1)` in parallel.
@njit(parallel=True)
def collapseArr(arr):
    assert arr.shape[2] == 3
    n, m = arr.shape[0], arr.shape[1]
    res = np.empty((n, 3))
    for i in prange(n):
        res[i, 0] = np.sum(arr[i, :, 0])
        res[i, 1] = np.sum(arr[i, :, 1])
        res[i, 2] = np.sum(arr[i, :, 2])
    return res

def calculate_velocity_induced_by_line_vortices(points, origins, terminations, strengths, collapse=True):
    r_1 = subtract(points, origins)
    r_2 = subtract(points, terminations)
    # NOTE: r_0 is computed on the fly by rDotProducts

    r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)

    r_1_cross_r_2_absolute_magnitude = cross_absolute_magnitude(r_1_cross_r_2)

    r_1_length = nb_2d_explicit_norm(r_1)
    r_2_length = nb_2d_explicit_norm(r_2)

    radius = 3.0e-16
    r_1_length[r_1_length < radius] = 0
    r_2_length[r_2_length < radius] = 0
    r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0

    r_0_dot_r_1, r_0_dot_r_2 = rDotProducts(r_1, r_2)

    with np.errstate(divide="ignore", invalid="ignore"):
        k = compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length)
        k = np.expand_dims(k, axis=2)
        induced_velocities = k * r_1_cross_r_2

    discard_singularities(induced_velocities)

    if collapse:
        induced_velocities = collapseArr(induced_velocities)

    return induced_velocities

在我的机器上，此代码比 arrays 大小为10**3的初始实现快 2.5 倍。 它还使用了一点memory 。

我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

问题描述

问题：

Function 套装：

语境：

以前的尝试：

更新：

更新1：

更新 2：

1 个解决方案

解决方案1
4 已采纳 2021-03-23 03:51:00

我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

问题描述

问题：

Function 套装：

语境：

以前的尝试：

更新：

更新1：

更新 2：

1 个解决方案

解决方案1 4 已采纳 2021-03-23 03:51:00

解决方案1
4 已采纳 2021-03-23 03:51:00