遍历结构数组时的性能差异

Question

I have recently started using Julia to speed up some code previously written in Python.我最近开始使用 Julia 来加速一些以前用 Python 编写的代码。 I only have prior experience with Python, so this is my first time caring about performance and I have found some strange behavior when looping over an array of structs.我只有使用 Python 的经验，所以这是我第一次关心性能，并且在循环结构数组时发现了一些奇怪的行为。 I am defining a new struct Gaussian, which represent a 2d Gaussian function and a function intensity() which calculates the amplitude of the function at a given position:我正在定义一个新的结构高斯，它表示一个二维高斯函数和一个函数强度（），它计算给定位置的函数的幅度：

struct Gaussian{T<:Float32}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end

function intensity(
    model::Gaussian,
    x::Float32,
    y::Float32
    )
    gaussian_value::Float32 = model.A*exp(
        -( 
            model.a * (x - model.x0)^2 +
            2 * model.b * (x - model.x0) * (y - model.y0) +
            model.c * (y - model.y0)^2
        )
    )
    return gaussian_value
end

Then, I make an array of 2000 random instances of Gaussian:然后，我制作了一个包含 2000 个随机高斯实例的数组：

function build_array()
    length = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:length]
    random_A = rand(Float32, (length, 1))
    random_a = rand(Float32, (length, 1))
    random_b = rand(Float32, (length, 1))
    random_c = rand(Float32, (length, 1));

    gaussians::Array{Gaussian} = []

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

When I benchmark a single call to the intensity() function, it takes ~100 ns with 1 allocation (makes sense).当我对强度（）函数的单个调用进行基准测试时，使用 1 次分配大约需要 100 ns（有意义）。 I would expect that looping over the array of Gaussians should then take 2000*100 ns = 200 us.我希望在高斯数组上循环应该花费 2000*100 ns = 200 us。 However, it actually takes about twice as long:但是，它实际上需要大约两倍的时间：

function total_intensity1(gaussian_list::Array{Gaussian})
    total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
end

function total_intensity2(gaussian_list::Array{Gaussian})
    total::Float32 = 0.
    for gaussian in gaussian_list
        total += intensity(gaussian, Float32(0.75), Float32(0.11))
    end
    return total 
end

@btime sum(intensity.(gaussians, Float32(0.75), Float32(0.11)))
@btime begin
    total::Float32 = 0.
    for gauss in gaussians
        total += intensity(gauss, Float32(0.75), Float32(0.11))
    end
    total
end
@btime total_intensity1(gaussians)
@btime total_intensity2(gaussians)

  397.700 μs (16004 allocations: 258.02 KiB)
  285.800 μs (8980 allocations: 234.06 KiB)
  396.100 μs (16002 allocations: 257.95 KiB)
  396.700 μs (16001 allocations: 250.02 KiB)

The number of allocations is also much larger than I would expect and there is a difference between the second and fourth method even though the code is pretty much the same.分配的数量也比我预期的要大得多，即使代码几乎相同，第二种方法和第四种方法之间也存在差异。 My questions:我的问题：

Where do these differences come from?这些差异从何而来？
How can I improve the performance of the code?如何提高代码的性能？

EDIT: For reference, I ended up changing my code to the following:编辑：作为参考，我最终将我的代码更改为以下内容：

struct Gaussian
    x0::Float32
    y0::Float32
    A::Float32
    a::Float32
    b::Float32
    c::Float32
end

function build_array()
    N = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:N]
    random_A = rand(Float32, N)
    random_a = rand(Float32, N)
    random_b = rand(Float32, N)
    random_c = rand(Float32, N);

    gaussians = Gaussian[]

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

function intensity(
    model::Gaussian,
    x,
    y
    )
    (;x0, y0, A, a, b, c) = model
    A*exp(-(a * (x - x0)^2 + 2 * b * (x - x0) * (y - y0) + c * (y - y0)^2))
end

function total_intensity(gaussian_list::Vector{<:Gaussian})
    total = sum(g->intensity(g, Float32(0.75), Float32(0.11)), gaussian_list)
end

@btime total_intensity($gaussians)

Which runs much faster:哪个运行得更快：

10.900 μs (0 allocations: 0 bytes)

Thank you to Nils Gudat and DNF for their suggestions!感谢 Nils Gudat 和 DNF 的建议！

Answer 1

TLDR version: Vector{Gaussian} should be Vector{Gaussian{Float32}} . TLDR 版本： Vector{Gaussian}应该是Vector{Gaussian{Float32}} 。

Your struct definition Gaussian{T<:Float32} is somewhat nonsensical.您的结构定义Gaussian{T<:Float32}有点荒谬。 Float32 cannot have any subtypes, so T can only be a Float32 . Float32不能有任何子类型，因此T只能是Float32 。 Therefore, either remove the restriction, replace it with something else (eg Real ), or just take away the type parameter entirely.因此，要么删除限制，将其替换为其他内容（例如Real ），要么完全删除类型参数。

This is bad:这是不好的：

gaussians::Array{Gaussian} = []

It creates a Vector{Any} which it then converts to a Vector{Gaussian} .它创建一个Vector{Any} ，然后将其转换为Vector{Gaussian} 。 Worse, Vector{Gaussian} is not a Vector{Gaussian{Float32}} .更糟糕的是， Vector{Gaussian}不是Vector{Gaussian{Float32}} 。 So either remove the whole type parameter, or make sure to use it.所以要么删除整个类型参数，要么确保使用它。 So,所以，

# good:
gaussians = Vector{Gaussian{Float32}}()
gaussians = Gaussian{Float32}[] # same as above

# bad
gaussians = Vector{Gaussian}()
# very bad, don't use this style, put types on the right hand side when constructing.
gaussians::Vector{Gaussian} = []

Same here, bad style一样的，画风不好

total::Float32 = 0.

Do this instead改为这样做

total = Float32(0.0)
# or use Float32 literal
total = 0.0f0
# or the generic way
total = zero(Float32)

In dynamic languages, types belong to values , not to variables.在动态语言中，类型属于值，而不属于变量。

BTW, you'll have to modify some of your function definitions:顺便说一句，您必须修改一些函数定义：

total_intensity1(gaussian_list::Array{Gaussian})

should be应该

total_intensity1(gaussian_list::Array{<:Gaussian})

There's more, but this is a start.还有更多，但这是一个开始。

Edit: OK, a few more things:编辑：好的，还有一些事情：

rand(Float32, (length, 1)) length is a super important function in Base, so it's normally good not to shadow it like this. rand(Float32, (length, 1)) length是 Base 中一个非常重要的函数，所以通常最好不要像这样遮蔽它。 And, make vectors instead of matrices:并且，制作向量而不是矩阵：
rand(Float32, (N, 1)) # this is an Nx1 matrix
rand(Float32, N) # this is a length-N vector
push!(gaussians, new_gaussian) This iteratively resizes the vector over and over. push!(gaussians, new_gaussian)这会反复调整向量的大小。 When you know the size of the vector as in your case, it is better to pre-allocate:当您知道向量的大小时，最好预先分配：
gaussians = Vector{Gaussian{Float32}}(undef, 2000)
You can avoid an unnecessary allocation here:您可以在这里避免不必要的分配：
total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11))) like this: total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))像这样：
total = sum(g->intensity(g, 0.75f0, 0.11f0), gaussian_list)

Explanation: sum(f.(x)) first creates the array f.(x) , then sums it, while sum(f, x) just applies f to each element before adding it to the sum.解释： sum(f.(x))首先创建数组f.(x) ，然后对其求和，而sum(f, x)只是将f应用于每个元素，然后再将其添加到总和中。

Here's an implementation with benchmarks:这是一个带有基准的实现：

struct Gaussian{T<:Real}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end
Gaussian(x::Real...) = Gaussian(promote(x...)...)

function intensity(model::Gaussian, x::Real, y::Real)
    val = model.A * exp(
        -( 
            model.a * (x - model.x0)^2 +
            2 * model.b * (x - model.x0) * (y - model.y0) +
            model.c * (y - model.y0)^2
        )
    )
    return val
end

function build_array(N=2000)
    return [Gaussian(ntuple(_->rand(Float32), 6)...) for _ in 1:N]
end

Benchmarks (remember to interpolate variables, and avoid global scope):基准（记住插入变量，并避免全局范围）：

julia> gaussians = build_array(2000);

julia> @btime sum(intensity.($gaussians, Float32(0.75), Float32(0.11)))
  14.600 μs (1 allocation: 7.94 KiB)
947.5305f0

julia> @btime sum(g->intensity(g, 0.75f0, 0.11f0), $gaussians)
  12.600 μs (0 allocations: 0 bytes)
947.5309f0

There's a slight difference in the final sums, since sum of vector uses a numerically superior method called pairwise summation.最终总和略有不同，因为向量总和使用了一种称为成对求和的数值优越的方法。

Final bonus: Try Tullio.jl, which also uses multithreading.最后的奖励：试试 Tullio.jl，它也使用多线程。 It doesn't make any difference for 2000 elements, but it does for larger arrays (using 12 threads here):它对 2000 个元素没有任何区别，但它对更大的数组有影响（这里使用 12 个线程）：

julia> using Tullio, LoopVectorization

julia> gaussians = build_array(200_000);

julia> @btime sum(g->intensity(g, 0.75f0, 0.11f0), $gaussians)
  1.228 ms (0 allocations: 0 bytes)
92722.7f0

julia> @btime @tullio s := intensity($gaussians[i], 0.75f0, 0.11f0)
  330.100 μs (197 allocations: 11.53 KiB)
92722.79f0

Answer 2

I don't have time to figure this one out in detail unfortunately, but the first thing I'd say is check whether this is a benchmarking artifact - gaussians is a global variable which should be interpolated into the benchmark using $ .不幸的是，我没有时间详细了解这一点，但我要说的第一件事是检查这是否是基准测试工件 - gaussians是一个全局变量，应该使用$插入到基准测试中。

As to your function, the type annotations are not doing anything for performance here, and will make your function less composable (eg you won't be able to autodiff through it give you're restricting everything to Float32 ).至于您的函数，类型注释在这里对性能没有任何作用，并且会使您的函数的可组合性降低（例如，您将无法通过它进行自动区分，因为您将所有内容都限制为Float32 ）。

Here's how I would write it:我会这样写：

function intensity(m, x, y)
    (; x₀, y₀, A, a, b, c) = m # destructuring input
    A * exp( -(a * (x - x₀)^2 + 2b * (x - x₀) * (y - y₀) + c * (y - y₀)^2 ) )
end

With that I'm getting:有了这个，我得到：

  231.100 μs (12001 allocations: 195.44 KiB)
  231.500 μs (12001 allocations: 195.44 KiB)
  229.200 μs (12000 allocations: 187.50 KiB)
  229.300 μs (12000 allocations: 187.50 KiB)

which is about 100μs faster than your original version on my machine.这比我机器上的原始版本快约 100μs。

遍历结构数组时的性能差异

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-06-29 19:10:59

解决方案2
1 2022-06-29 09:40:08

遍历结构数组时的性能差异

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-06-29 19:10:59

解决方案2 1 2022-06-29 09:40:08

解决方案1
2 已采纳 2022-06-29 19:10:59

解决方案2
1 2022-06-29 09:40:08