最有效的方法来计算矩阵的每个元素的指数

Question

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which: 我正在从Matlab迁移到C + GSL，我想知道什么是计算矩阵B的最有效方法：

B[i][j] = exp(A[i][j])

where i in [0, Ny] and j in [0, Nx]. 其中i在[0，Ny]中，j在[0，Nx]中。

Notice that this is different from matrix exponential: 请注意，这与矩阵指数不同：

B = exp(A)

which can be accomplished with some unstable/unsupported code in GSL (linalg.h). 这可以通过GSL（linalg.h）中的一些不稳定/不支持的代码来完成。

I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it? 我刚刚找到了强力解决方案（几个'for'循环），但有没有更明智的方法呢？

EDIT 编辑

Results from the solution post of Drew Hall 来自Drew Hall的解决方案的结果

All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. 所有结果都来自1024x1024 for(for)循环，其中在每次迭代中分配两个double值（复数）。 The time is the averaged time over 100 executions . 时间是超过100次执行的平均时间 。

Results when taking into account the {Row,Column}-Major mode to store the matrix: 考虑{Row，Column} - 存储矩阵的主要模式时的结果：
- 226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1). 在Row-Major模式下循环内部循环中的行时为226.56 ms（情况1）。
- 223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2). 在Row-Major模式下循环内循环中的列时为223.22 ms（情况2）。
- 224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3). 使用GSL提供的gsl_matrix_complex_set函数时的224.60 ms（案例3）。

Source code for case 1 : 案例1的源代码 ：

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
        matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 2 : 案例2的源代码 ：

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
        matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 3 : 案例3的源代码 ：

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        gsl_matrix_complex_set(matrix, i, j, c_value);
    }
}

Answer 1

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. 没有办法避免遍历所有元素并在每个元素上调用exp()或等效元素。 But there are faster and slower ways to iterate. 但是有更快更慢的迭代方式。

In particular, your goal should be to mimimize cache misses. 特别是，您的目标应该是最大限度地减少缓存未命中。 Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). 找出你的数据是以行主要顺序还是按列主顺序存储，并确保循环使得内部循环迭代在内存中连续存储的元素，并且外部循环将大步向下一行（ if row major）或column（如果是major major）。 Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix). 虽然这看起来微不足道，但它可以在性能上产生巨大的差异（取决于矩阵的大小）。

Once you've handled the cache, your next goal is to remove loop overhead. 处理完缓存后，您的下一个目标是消除循环开销。 The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (M N bound). 第一步（如果您的矩阵API支持它）是从嵌套循环（M＆N边界）到迭代基础数据（M N界限）的单个循环。 You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this. 你需要得到一个指向底层内存块的原始指针（即双倍而不是双倍**）才能做到这一点。

Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. 最后，抛出一些循环展开（也就是说，为循环的每次迭代做8或16个元素）以进一步减少循环开销，这可能就像你可以做到的那样快。 You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0). 你可能需要一个带有fall-through的最终switch语句来清理其余的元素（当你的数组大小为％block size！= 0时）。

Answer 2

不，除非有一些我没有听说过的奇怪的数学怪癖，你几乎只需要用两个for循环遍历元素。

Answer 3

If you just want to apply exp to an array of numbers, there's really no shortcut. 如果您只想将exp应用于数组，那么实际上没有捷径。 You gotta call it (Nx * Ny) times. 你得打电话给它（Nx * Ny）次。 If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help. 如果某些矩阵元素很简单，比如0，或者有重复的元素，那么一些memoization可能会有所帮助。

However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM . 但是，如果您真正想要的是矩阵指数（这是非常有用的），我们依赖的算法是DGPADM 。 It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it. 它在Fortran中，但您可以使用f2c将其转换为C. 这是关于它的文章。

Answer 4

Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. 由于未显示循环的内容，计算c_value的位我们不知道代码的性能是受内存带宽限制还是受CPU限制。 The only way to know for sure is to use a profiler, and a sophisticated one at that. 确切知道的唯一方法是使用分析器，并使用复杂的分析器。 It needs to be able to measure memory latency, ie the amount of time the CPU has been idle waiting for data to arrive from RAM. 它需要能够测量内存延迟，即CPU等待数据从RAM到达的空闲时间。

If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. 如果您受内存带宽的限制，那么一旦您按顺序访问内存，就无法做很多事情。 The CPU and memory work best when data is fetched sequentially. 顺序获取数据时，CPU和内存最有效。 Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. 随机访问达到了吞吐量，因为数据更有可能必须从RAM中提取到缓存中。 You could always try getting faster RAM. 你总是可以尝试获得更快的RAM。

If you're limited by CPU then there are a few more options available to you. 如果您受到CPU的限制，那么您可以使用更多选项。 Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). 使用SIMD是一种选择，手动编码浮点代码（由于许多原因，C / C ++编译器在FPU代码上不是很好）。 If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. 如果这是我，并且内循环中的代码允许它，我将有两个指向数组的指针，一个在开始时，另一个在第4个/ 5个中。 Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. 在每次迭代中，将使用第一指针和使用第二指针的标量FPU操作来执行SIMD操作，使得循环的每次迭代执行五个值。 Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. 然后，我将SIMD指令与FPU指令交错，以降低延迟成本。 This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (ie prefetch data for you without any prompting or special instructions). 这不应该影响您的缓存，因为（至少在Pentium上）MMU可以同时流式传输多达四个数据流（即为您预取数据而无需任何提示或特殊指令）。

最有效的方法来计算矩阵的每个元素的指数

问题描述

Results from the solution post of Drew Hall 来自Drew Hall的解决方案的结果

4 个解决方案

解决方案1
5 已采纳 2010-07-24 02:48:27

解决方案2
3 2010-07-23 21:38:58

解决方案3
2 2010-07-24 02:04:45

解决方案4
0 2010-07-25 23:12:54

最有效的方法来计算矩阵的每个元素的指数

问题描述

Results from the solution post of Drew Hall 来自Drew Hall的解决方案的结果

4 个解决方案

解决方案1 5 已采纳 2010-07-24 02:48:27

解决方案2 3 2010-07-23 21:38:58

解决方案3 2 2010-07-24 02:04:45

解决方案4 0 2010-07-25 23:12:54

解决方案1
5 已采纳 2010-07-24 02:48:27

解决方案2
3 2010-07-23 21:38:58

解决方案3
2 2010-07-24 02:04:45

解决方案4
0 2010-07-25 23:12:54