我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

Question

Problem:问题：

I am trying to increase the speed of an aerodynamics function in Python.我正在尝试提高 Python 中的空气动力学 function 的速度。

Function Set: Function 套装：

import numpy as np
from numba import njit

def calculate_velocity_induced_by_line_vortices(
    points, origins, terminations, strengths, collapse=True
):

    # Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).
    # This will allow NumPy to broadcast the upcoming subtractions.
    points = np.expand_dims(points, axis=1)
    
    # Define the vectors from the vortex to the points. r_1 and r_2 now both are of
    # shape (N x M x 3). Each row/column pair holds the vector associated with each
    # point/vortex pair.
    r_1 = points - origins
    r_2 = points - terminations
    
    r_0 = r_1 - r_2
    r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
    r_1_cross_r_2_absolute_magnitude = (
        r_1_cross_r_2[:, :, 0] ** 2
        + r_1_cross_r_2[:, :, 1] ** 2
        + r_1_cross_r_2[:, :, 2] ** 2
    )
    r_1_length = nb_2d_explicit_norm(r_1)
    r_2_length = nb_2d_explicit_norm(r_2)
    
    # Define the radius of the line vortices. This is used to get rid of any
    # singularities.
    radius = 3.0e-16
    
    # Set the lengths and the absolute magnitudes to zero, at the places where the
    # lengths and absolute magnitudes are less than the vortex radius.
    r_1_length[r_1_length < radius] = 0
    r_2_length[r_2_length < radius] = 0
    r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
    
    # Calculate the vector dot products.
    r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)
    r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)
    
    # Calculate k and then the induced velocity, ignoring any divide-by-zero or nan
    # errors. k is of shape (N x M)
    with np.errstate(divide="ignore", invalid="ignore"):
        k = (
            strengths
            / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
            * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
        )
    
        # Set the shape of k to be (N x M x 1) to support numpy broadcasting in the
        # subsequent multiplication.
        k = np.expand_dims(k, axis=2)
    
        induced_velocities = k * r_1_cross_r_2
    
    # Set the values of the induced velocity to zero where there are singularities.
    induced_velocities[np.isinf(induced_velocities)] = 0
    induced_velocities[np.isnan(induced_velocities)] = 0

    if collapse:
        induced_velocities = np.sum(induced_velocities, axis=1)

    return induced_velocities


@njit    
def nb_2d_explicit_norm(vectors):
    return np.sqrt(
        (vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2
    )


@njit
def nb_2d_explicit_cross(a, b):
    e = np.zeros_like(a)
    e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]
    e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]
    e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]
    return e

Context:语境：

This function is used by Ptera Software , an open-source solver for flapping wing aerodynamics.这个 function 由Ptera Software使用，这是一款用于扑翼空气动力学的开源求解器。 As shown by the profile output below, it is by far the largest contributor to Ptera Software's run time.如以下配置文件 output 所示，它是迄今为止 Ptera Software 运行时间的最大贡献者。

Currently, Ptera Software takes just over 3 minutes to run a typical case, and my goal is to get this below 1 minute.目前，Ptera Software 运行一个典型案例只需 3 多分钟，我的目标是在 1 分钟内完成。

The function takes in a group of points, origins, terminations, and strengths. function 包含一组点、起点、终点和强度。 At every point, it finds the induced velocity due to the line vortices, which are characterized by the groups of origins, terminations, and strengths.在每个点上，它都会找到由线涡流引起的诱导速度，这些线涡流的特征在于起点、终点和强度的组。 If collapse is true, then the output is the cumulative velocity induced at each point due to the vortices.如果塌陷是真的，那么 output 是由于涡流在每个点处引起的累积速度。 If false, the function outputs each vortex's contribution to the velocity at each point.如果为 false，则 function 输出每个涡流对每个点的速度的贡献。

During a typical run, the velocity function is called approximately 2000 times.在典型的运行过程中，速度 function 被调用大约 2000 次。 At first, the calls involve vectors with relatively small input arguments (around 200 points, origins, terminations, and strengths).起初，调用涉及输入相对较小的向量 arguments（大约 200 个点、起点、终点和强度）。 Later calls involve large input arguments (around 400 points and around 6,000 origins, terminations, and strengths).后来的调用涉及大量输入 arguments（大约 400 个点和大约 6,000 个起点、终点和强度）。 An ideal solution would be fast for all size inputs, but increasing the speed of large input calls is more important.一个理想的解决方案对于所有大小的输入都是快速的，但是提高大输入调用的速度更为重要。

For testing, I recommend running the following script with your own implementation of the function:为了进行测试，我建议使用您自己的 function 实现运行以下脚本：

import timeit

import matplotlib.pyplot as plt
import numpy as np

n_repeat = 2
n_execute = 10 ** 3
min_oom = 0
max_oom = 3

times_py = []

for i in range(max_oom - min_oom + 1):
    n_elem = 10 ** i
    n_elem_pretty = np.format_float_scientific(n_elem, 0)
    print("Number of elements: " + n_elem_pretty)

    # Benchmark Python.
    print("\tBenchmarking Python...")
    setup = '''
import numpy as np

these_points = np.random.random((''' + str(n_elem) + ''', 3))
these_origins = np.random.random((''' + str(n_elem) + ''', 3))
these_terminations = np.random.random((''' + str(n_elem) + ''', 3))
these_strengths = np.random.random(''' + str(n_elem) + ''')

def calculate_velocity_induced_by_line_vortices(points, origins, terminations,
                                                strengths, collapse=True):
    pass
    '''
    statement = '''
results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,
                                                           these_terminations,
                                                           these_strengths)
    '''
    
    times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)
    time_py = min(times)/n_execute
    time_py_pretty = np.format_float_scientific(time_py, 2)
    print("\t\tAverage Time per Loop: " + time_py_pretty + " s")

    # Record the times.
    times_py.append(time_py)

sizes = [10 ** i for i in range(max_oom - min_oom + 1)]

fig, ax = plt.subplots()

ax.plot(sizes, times_py, label='Python')
ax.set_xscale("log")
ax.set_xlabel("Size of List or Array (elements)")
ax.set_ylabel("Average Time per Loop (s)")
ax.set_title(
    "Comparison of Different Optimization Methods\nBest of "
    + str(n_repeat)
    + " Runs, each with "
    + str(n_execute)
    + " Loops"
)
ax.legend()
plt.show()

Previous Attempts:以前的尝试：

My prior attempts at speeding up this function involved vectorizing it (which worked great, so I kept those changes) and trying out Numba's JIT compiler.我之前尝试加速这个 function 涉及对其进行矢量化（效果很好，所以我保留了这些更改）并尝试了 Numba 的 JIT 编译器。 I had mixed results with Numba.我对 Numba 的结果好坏参半。 When I tried to use Numba on a modified version of the entire velocity function, my results were much slower than before.当我尝试在整个速度 function 的修改版本上使用 Numba 时，我的结果比以前慢了很多。 However, I found that Numba significantly sped up the cross-product and norm functions, which I implemented above.但是，我发现 Numba 显着加快了我在上面实现的叉积和范数函数。

Updates:更新：

Update 1:更新1：

Based on Mercury's comment (which has since been deleted), I replaced根据 Mercury 的评论（已被删除），我替换了

points = np.expand_dims(points, axis=1)
r_1 = points - origins
r_2 = points - terminations

with two calls to the following function:两次调用以下 function：

@njit
def subtract(a, b):
    c = np.empty((a.shape[0], b.shape[0], 3))
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            for k in range(3):
                c[i, j, k] = a[i, k] - b[j, k]
    return c

This resulted in a speed increase from 227 s to 220 s.这导致速度从 227 秒增加到 220 秒。 This is better, However.不过这样更好。 it is still not fast enough.它仍然不够快。

I also have tried setting the njit fastmath flag to true, and using a numba function instead of calls to np.einsum.我还尝试将 njit fastmath 标志设置为 true，并使用 numba function 而不是调用 np.einsum。 Neither increased the speed.都没有提高速度。

Update 2:更新 2：

With Jérôme Richard's answer, the run time is now 156 s, which is a decrease of 29%, I'm satisfied enough to accept this answer, but feel free to make other suggestions if you think you can improve on their work!使用 Jérôme Richard 的回答，运行时间现在是 156 秒，减少了 29%，我很满意接受这个答案，但如果您认为可以改进他们的工作，请随时提出其他建议！

Answer 1

First of all, Numba can perform parallel computations resulting in a faster code if you manually request it using mainly parallel=True and prange .首先，如果您主要使用parallel=True和prange手动请求它，Numba 可以执行并行计算，从而产生更快的代码。 This is useful for big arrays (but not for small ones).这对于大的 arrays 很有用（但对小的不适用）。

Moreover, your computation is mainly memory bound .此外，您的计算主要是memory bound 。 Thus, you should avoid creating big arrays when they are not reused multiple times, or more generally when they cannot be recomputed on the fly (in a relatively cheap way).因此，您应该避免创建大的 arrays 当它们没有被多次重用时，或者更一般地当它们不能被重新计算时（以相对便宜的方式）。 This is the case for r_0 for example.例如r_0就是这种情况。

In addition, memory access pattern matters: vectorization is more efficient when accesses are contiguous in memory and the cache/RAM is use more efficiently.此外， memory 访问模式很重要：当 memory 中的访问是连续的并且缓存/RAM 的使用效率更高时，向量化更有效。 Consequently, arr[0, :, :] = 0 should be faster then arr[:, :, 0] = 0 .因此， arr[0, :, :] = 0应该比arr[:, :, 0] = 0更快。 Similarly, arr[:, :, 0] = arr[:, :, 1] = 0 should be mush slower than arr[:, :, 0:2] = 0 since the former performs to noncontinuous memory passes while the latter performs only one more contiguous memory pass.同样， arr[:, :, 0] = arr[:, :, 1] = 0应该比arr[:, :, 0:2] = 0慢，因为前者执行到非连续 memory 通过，而后者执行只有一个更连续的 memory 通行证。 Sometimes, it can be beneficial to transpose your data so that the following calculations are much faster.有时，转置数据可能会有所帮助，以便以下计算更快。

Moreover, Numpy tends to create many temporary arrays that are costly to allocate.此外，Numpy 往往会创建许多临时的 arrays ，分配成本很高。 This is a huge problem when the input arrays are small.当输入 arrays 很小时，这是一个大问题。 The Numba jit can avoid that in most cases.在大多数情况下，Numba jit 可以避免这种情况。

Finally, regarding your computation, it may be a good idea to use GPUs for big arrays (definitively not for small ones).最后，关于您的计算，最好将GPU用于大型 arrays（绝对不适合小型）。 You can give a look to cupy or clpy to do that quite easily.你可以看一下cupy或clpy很容易做到这一点。

Here is an optimized implementation working on the CPU:这是在 CPU 上工作的优化实现：

import numpy as np
from numba import njit, prange

@njit(parallel=True)
def subtract(a, b):
    c = np.empty((a.shape[0], b.shape[0], 3))
    for i in prange(c.shape[0]):
        for j in range(c.shape[1]):
            for k in range(3):
                c[i, j, k] = a[i, k] - b[j, k]
    return c

@njit(parallel=True)
def nb_2d_explicit_norm(vectors):
    res = np.empty((vectors.shape[0], vectors.shape[1]))
    for i in prange(res.shape[0]):
        for j in range(res.shape[1]):
            res[i, j] = np.sqrt(vectors[i, j, 0] ** 2 + vectors[i, j, 1] ** 2 + vectors[i, j, 2] ** 2)
    return res

# NOTE: better memory access pattern
@njit(parallel=True)
def nb_2d_explicit_cross(a, b):
    e = np.empty(a.shape)
    for i in prange(e.shape[0]):
        for j in range(e.shape[1]):
            e[i, j, 0] = a[i, j, 1] * b[i, j, 2] - a[i, j, 2] * b[i, j, 1]
            e[i, j, 1] = a[i, j, 2] * b[i, j, 0] - a[i, j, 0] * b[i, j, 2]
            e[i, j, 2] = a[i, j, 0] * b[i, j, 1] - a[i, j, 1] * b[i, j, 0]
    return e

# NOTE: avoid the slow building of temporary arrays
@njit(parallel=True)
def cross_absolute_magnitude(cross):
    return cross[:, :, 0] ** 2 + cross[:, :, 1] ** 2 + cross[:, :, 2] ** 2

# NOTE: avoid the slow building of temporary arrays again and multiple pass in memory
# Warning: do the work in-place
@njit(parallel=True)
def discard_singularities(arr):
    for i in prange(arr.shape[0]):
        for j in range(arr.shape[1]):
            for k in range(3):
                if np.isinf(arr[i, j, k]) or np.isnan(arr[i, j, k]):
                    arr[i, j, k] = 0.0

@njit(parallel=True)
def compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length):
    return (strengths
        / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
        * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
    )

@njit(parallel=True)
def rDotProducts(b, c):
    assert b.shape == c.shape and b.shape[2] == 3
    n, m = b.shape[0], b.shape[1]
    ab = np.empty((n, m))
    ac = np.empty((n, m))
    for i in prange(n):
        for j in range(m):
            ab[i, j] = 0.0
            ac[i, j] = 0.0
            for k in range(3):
                a = b[i, j, k] - c[i, j, k]
                ab[i, j] += a * b[i, j, k]
                ac[i, j] += a * c[i, j, k]
    return (ab, ac)

# Compute `np.sum(arr, axis=1)` in parallel.
@njit(parallel=True)
def collapseArr(arr):
    assert arr.shape[2] == 3
    n, m = arr.shape[0], arr.shape[1]
    res = np.empty((n, 3))
    for i in prange(n):
        res[i, 0] = np.sum(arr[i, :, 0])
        res[i, 1] = np.sum(arr[i, :, 1])
        res[i, 2] = np.sum(arr[i, :, 2])
    return res

def calculate_velocity_induced_by_line_vortices(points, origins, terminations, strengths, collapse=True):
    r_1 = subtract(points, origins)
    r_2 = subtract(points, terminations)
    # NOTE: r_0 is computed on the fly by rDotProducts

    r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)

    r_1_cross_r_2_absolute_magnitude = cross_absolute_magnitude(r_1_cross_r_2)

    r_1_length = nb_2d_explicit_norm(r_1)
    r_2_length = nb_2d_explicit_norm(r_2)

    radius = 3.0e-16
    r_1_length[r_1_length < radius] = 0
    r_2_length[r_2_length < radius] = 0
    r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0

    r_0_dot_r_1, r_0_dot_r_2 = rDotProducts(r_1, r_2)

    with np.errstate(divide="ignore", invalid="ignore"):
        k = compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length)
        k = np.expand_dims(k, axis=2)
        induced_velocities = k * r_1_cross_r_2

    discard_singularities(induced_velocities)

    if collapse:
        induced_velocities = collapseArr(induced_velocities)

    return induced_velocities

On my machine, this code is 2.5 times faster than the initial implementation on arrays of size 10**3 .在我的机器上，此代码比 arrays 大小为10**3的初始实现快 2.5 倍。 It also use a bit less memory .它还使用了一点memory 。

我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

问题描述

Problem:问题：

Function Set: Function 套装：

Context:语境：

Previous Attempts:以前的尝试：

Updates:更新：

Update 1:更新1：

Update 2:更新 2：

1 个解决方案

解决方案1
4 已采纳 2021-03-23 03:51:00

我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

问题描述

Problem:问题：

Function Set: Function 套装：

Context:语境：

Previous Attempts:以前的尝试：

Updates:更新：

Update 1:更新1：

Update 2:更新 2：

1 个解决方案

解决方案1 4 已采纳 2021-03-23 03:51:00

解决方案1
4 已采纳 2021-03-23 03:51:00