简体   繁体   English

在MATLAB中加快exp(A * x)的解析方法

[英]Analytical way of speeding up exp(A*x) in MATLAB

I need to calculate f(x)=exp(A*x) repeatedly for a tiny, variable column vector x and a huge, constant matrix A (many rows, few columns). 我需要为一个微小的可变列向量x和一个巨大的恒定矩阵A (许多行,几列)重复计算f(x)=exp(A*x) )。 In other words, the x are few, but the A*x are many. 换句话说, x很少,但是A*x很多。 My problem dimensions are such that A*x takes about as much runtime as the exp() part. 我的问题维度是, A*x占用的运行时间与exp()部分差不多。

Apart from Taylor expansion and pre-calculating a range of values exp(y) (assuming known the range y of values of A*x ), which I haven't managed to speed up considerably (while maintaining accuracy) with respect to what MATLAB is doing on its own, I am thinking about analytically restating the problem in order to be able to precalculate some values. 除了泰勒展开法和预先计算值范围exp(y) (假设已知A*x值的范围y )之外,相对于MATLAB,我还没有设法大大提高(同时保持精度)我自己正在做的事情,我正在考虑分析性地重新提出问题,以便能够预先计算一些值。

For example, I find that exp(A*x)_i = exp(\\sum_j A_ij x_j) = \\prod_j exp(A_ij x_j) = \\prod_j exp(A_ij)^x_j 例如,我发现exp(A*x)_i = exp(\\sum_j A_ij x_j) = \\prod_j exp(A_ij x_j) = \\prod_j exp(A_ij)^x_j

This would allow me to precalculate exp(A) once, but the required exponentiation in the loop is as costly as the original exp() function call, and the multiplications (\\prod) have to be carried out in addition. 这将使我可以预先计算一次exp(A) ,但是循环中所需的幂运算与原始exp()函数调用一样昂贵,并且还必须执行乘法(\\ prod)。

Is there any other idea that I could follow, or solutions within MATLAB that I may have missed? 还有其他我可以遵循的想法,或者我可能错过的MATLAB中的解决方案?

Edit: some more details 编辑:更多细节

A is 26873856 by 81 in size (yes, it's that huge), so x is 81 by 1. nnz(A) / numel(A) is 0.0012 , nnz(A*x) / numel(A*x) is 0.0075 . A的大小为81的26873856(是的,那是巨大的),所以x长度为81的nnz(A) / numel(A)0.0012nnz(A*x) / numel(A*x)0.0075 I already use a sparse matrix to represent A , however, exp() of a sparse matrix is not sparse any longer. 我已经使用稀疏矩阵表示A ,但是,稀疏矩阵的exp()不再稀疏。 So in fact, I store x non-sparse and I calculate exp(full(A*x)) which turned out to be as fast/slow as full(exp(A*x)) (I think A*x is non-sparse anyway, since x is non-sparse.) exp(full(A*sparse(x))) is a way to have a sparse A*x , but is slower. 因此,实际上,我存储了x非稀疏元素,并计算了exp(full(A*x)) ,结果与full(exp(A*x))一样快/慢(我认为A*x是非-无论如何还是稀疏的,因为x是非稀疏的。) exp(full(A*sparse(x)))是具有稀疏A*x一种方法,但速度较慢。 Even slower variants are exp(A*sparse(x)) (with doubled memory impact for a non-sparse matrix of type sparse) and full(exp(A*sparse(x)) (which again yields a non-sparse result). 更慢的变体是exp(A*sparse(x)) (对于稀疏类型的非稀疏矩阵具有两倍的内存影响)和full(exp(A*sparse(x)) (再次产生非稀疏结果) 。

sx = sparse(x);
tic, for i = 1 : 10, exp(full(A*x)); end, toc
tic, for i = 1 : 10, full(exp(A*x)); end, toc
tic, for i = 1 : 10, exp(full(A*sx)); end, toc
tic, for i = 1 : 10, exp(A*sx); end, toc
tic, for i = 1 : 10, full(exp(A*sx)); end, toc

Elapsed time is 1.485935 seconds.
Elapsed time is 1.511304 seconds.
Elapsed time is 2.060104 seconds.
Elapsed time is 3.194711 seconds.
Elapsed time is 4.534749 seconds.

Yes, I do calculate element-wise exp, I update the above equation to reflect that. 是的,我确实按元素计算exp,我更新了上面的公式来反映这一点。

One more edit: I tried to be smart, with little success: 再进行一次编辑:我试图变得聪明,但收效甚微:

tic, for i = 1 : 10, B = exp(A*x); end, toc
tic, for i = 1 : 10, C = 1 + full(spfun(@(x) exp(x) - 1, A * sx)); end, toc
tic, for i = 1 : 10, D = 1 + full(spfun(@(x) exp(x) - 1, A * x)); end, toc
tic, for i = 1 : 10, E = 1 + full(spfun(@(x) exp(x) - 1, sparse(A * x))); end, toc
tic, for i = 1 : 10, F = 1 + spfun(@(x) exp(x) - 1, A * sx); end, toc
tic, for i = 1 : 10, G = 1 + spfun(@(x) exp(x) - 1, A * x); end, toc
tic, for i = 1 : 10, H = 1 + spfun(@(x) exp(x) - 1, sparse(A * x)); end, toc

Elapsed time is 1.490776 seconds.
Elapsed time is 2.031305 seconds.
Elapsed time is 2.743365 seconds.
Elapsed time is 2.818630 seconds.
Elapsed time is 2.176082 seconds.
Elapsed time is 2.779800 seconds.
Elapsed time is 2.900107 seconds.

Computers don't really do exponents. 计算机实际上并没有做指数。 You would think they do, but what they do is high-accuracy polynomial approximations. 您可能会认为它们可以,但是它们的作用是高精度多项式逼近。

References: 参考文献:

The last reference looked quite nice. 最后一个参考看起来不错。 Perhaps it should have been first. 也许应该是第一位的。

Since you are working on images, you likely have discrete number of intensity levels (255 typically). 由于您正在处理图像,因此强度级别可能离散(通常为255)。 This can allow reduced sampling, or lookups, depending on the nature of "A". 根据“ A”的性质,这可以减少采样或查找。 One way to check this is to do something like the following for a sufficiently representative group of values of "x": 一种检查方法是对“ x”的一组足够有代表性的值执行以下操作:

y=Ax
cdfplot(y(:))

If you were able to pre-segment your images into "more interesting" and "not as interesting" - like if you were looking at an x-ray being able to trim out all the "outside the human body" locations and clamp them to zero to pre-sparsify your data, that could reduce your number of unique values. 如果您可以将图像预先分段为“更有趣”和“不太有趣”-就像您正在查看X射线那样,可以修剪所有“人体以外”的位置并将其固定到零以预先分配数据,这样可以减少唯一值的数量。 You might consider the previous for each unique "mode" inside the data. 您可能会考虑数据中每个唯一的“模式”的前一个。

My approaches would include: 我的方法包括:

  • look at alternate formulations of exp(x) that are lower accuracy but higher speed 看一下精度较低但速度较高的exp(x)的替代公式
  • consider table lookups if you have few enough levels of "x" 如果您的“ x”级别不足,请考虑进行表查找
  • consider a combination of interpolation and table lookups if you have "slightly too many" levels to do a table lookup 如果您具有“稍微太多”的级别来进行表查找,请考虑将插值和表查找结合使用
  • consider a single lookup (or alternate formulation) based on segmented mode. 考虑基于分段模式的单个查找(或替代公式)。 If you know it is a bone and are looking for a vein, then maybe it should get less high-cost data processing applied. 如果您知道它是骨头并且正在寻找静脉,那么也许应该减少应用高成本的数据处理。

Now I have to ask myself why would you be living in so many iterations of exp(A*x)*x and I think you might be switching back and forth between frequency/wavenumber domain and time/space domain. 现在,我不得不问自己,为什么你会生活在exp(A * x)* x的这么多迭代中,我想你可能会在频/波数域和时/空域之间来回切换。 You also might be dealing with probabilities using exp(x) as a basis, and doing some Bayesian fun. 您还可能使用exp(x)作为基础来处理概率,并做一些贝叶斯乐趣。 I don't know that exp(x) is a good conjugate prior, so I'm going to go with the fourier material. 我不知道exp(x)是一个很好的共轭先验,所以我将使用傅立叶材料。

Other options: - consider use of fft, fft2, or fftn given your matrices - they are fast and might do part of what you are looking for. 其他选择:-考虑到矩阵,考虑使用fft,fft2或fftn-它们速度很快,可能会满足您的需求。

I am sure there is a forier domain variation on the following: 我敢肯定,在以下方面会有更强大的领域变化:

You might be able to mix the lookup with a compute using the woodbury matrix. 您可能可以使用Woodbury矩阵将查找与计算混合在一起。 I would have to think about that some to be sure though. 我必须考虑一些可以肯定的事情。 ( link ) At one point I knew that everything that mattered (CFD, FEA, FFT) were all about the matrix inversion, but I have since forgotten the particular details. 链接 )在某一时刻,我知道所有重要的事情(CFD,FEA,FFT)都与矩阵求逆有关,但此后我就忘记了具体细节。

Now, if you are living in MatLab then you might consider using "coder" which converts MatLab code to c-code. 现在,如果您住在MatLab中,则可以考虑使用“编码器”将MatLab代码转换为C代码。 No matter how much fun an interpreter may be, a good c-compiler can be a lot faster. 不管解释器有多有趣,一个好的C编译器都可以更快。 The mnemonic (hopefully not too ambitious) that I use is shown here: link starting around 13:49. 我使用的助记符(希望不太雄心勃勃)显示在此处: 链接始于13:49。 It is really simple, but it shows the difference between a canonical interpreted language (python) and compiled version of the same (cython/c). 它确实很简单,但是它显示了规范解释语言(python)和该语言的编译版本(cython / c)之间的区别。

I'm sure that if I had some more specifics, and was requested to, then I could engage more aggressively in a more specifically relevant answer. 我确定,如果我有更多具体要求并被要求这样做,那么我可以更积极地参与更具体相关的答案。

You might not have a good way to do it on conventional hardware, buy you might consider something like a GPGPU. 您可能没有在常规硬件上执行此操作的好方法,可以考虑购买GPGPU之类的产品。 CUDA and its peers have massively parallel operations that allow substantial speedup for the cost of a few video cards. CUDA及其对等方具有大规模的并行操作,可以大幅提高一些视频卡的成本。 You can have thousands of "cores" (overglorified pipelines) doing the work of a few ALU's and if the job is properly parallelizable (as this looks like) then it can get done a LOT faster. 您可以有成千上万个“核心”(过度分散的管道)来完成几个ALU的工作,并且如果该工作可以并行化(看起来像这样),那么它可以更快地完成很多工作。

EDIT: 编辑:

I was thinking about Eureqa . 我在想Eureqa One option that I would consider if I had some "big iron" for development but not production would be to use their Eureqa product to come up with a fast enough, accurate enough approximation. 如果我有一些“大牌”用于开发但没有生产,我会考虑的一个选择是使用他们的Eureqa产品提出足够快,足够准确的近似值。

If you performed a 'quick' singular value decomposition of your "A" matrix, you would find that the dominant performance is governed by 81 eigenvectors. 如果对“ A”矩阵执行“快速”奇异值分解,则会发现主要性能由81个特征向量控制。 I would look at the eigenvalues and see if there were only a few of those 81 eigenvectors providing the majority of the information. 我将查看特征值,看看在这81个特征向量中是否只有少数能提供大部分信息。 If that was the case, then you can clamp the others to zero, and construct a simple transformation. 如果是这种情况,那么您可以将其他变量限制为零,并构造一个简单的转换。

Now, if it were me, I would want to get "A" out of the exponent. 现在,如果是我,我希望从指数中减去“ A”。 I'm wondering if you can look at the 81x81 eigenvector matrix and "x" and think a little about linear algebra, and what space you are projecting your vectors into. 我想知道您是否可以查看81x81特征向量矩阵和“ x”,并对线性代数以及将向量投影到哪个空间中进行一些思考。 Is there any way that you can make a function that looks like the following: 有什么方法可以使您的函数看起来像下面这样:

f(x) = B2 * exp( B1 * x ) f(x)= B2 * exp(B1 * x)

such that the 这样

B1 * x B1 * x

is much smaller rank than your current 比您目前的排名小得多

Ax. 斧头。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM