[英]Efficient way to compute the Vandermonde matrix
I'm calculating Vandermonde matrix
for a fairly large 1D array. 我正在为一个相当大的1D阵列计算
Vandermonde matrix
。 The natural and clean way to do this is using np.vander()
. 这样做的自然而干净的方法是使用
np.vander()
。 However, I found that this is approx. 但是,我发现这是约。 2.5x slower than a list comprehension based approach.
比基于列表推导的方法慢2.5倍 。
In [43]: x = np.arange(5000)
In [44]: N = 4
In [45]: %timeit np.vander(x, N, increasing=True)
155 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# one of the listed approaches from the documentation
In [46]: %timeit np.flip(np.column_stack([x**(N-1-i) for i in range(N)]), axis=1)
65.3 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [47]: np.all(np.vander(x, N, increasing=True) == np.flip(np.column_stack([x**(N-1-i) for i in range(N)]), axis=1))
Out[47]: True
I'm trying to understand where the bottleneck is and the reason why does the implementation of native np.vander()
is ~ 2.5x slower. 我试图了解瓶颈的位置以及原生
np.vander()
的实现速度慢约2.5倍的原因 。
Efficiency matters for my implementation. 效率对我的实施至关重要。 So, even faster alternatives are also welcome!
因此,也欢迎更快的替代品!
How about broadcasted exponentiation? 广播取幂怎么样?
%timeit (x ** np.arange(N)[:, None]).T
43 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Sanity check - 完整性检查 -
np.all((x ** np.arange(N)[:, None]).T == np.vander(x, N, increasing=True))
True
The caveat here is that this speedup is possible only if your input array x
has a dtype
of int
. 需要注意的是,只有当输入数组
x
的dtype
为int
才能实现此加速。 As @Warren Weckesser pointed out in a comment, the broadcasted exponentiation slows down for floating point arrays. 正如@Warren Weckesser在一篇评论中所指出的那样,广播指数对于浮点阵列的速度变慢了。
As for why np.vander
is slow, take a look at the source code - 至于为什么
np.vander
很慢,请看一下源代码 -
x = asarray(x)
if x.ndim != 1:
raise ValueError("x must be a one-dimensional array or sequence.")
if N is None:
N = len(x)
v = empty((len(x), N), dtype=promote_types(x.dtype, int))
tmp = v[:, ::-1] if not increasing else v
if N > 0:
tmp[:, 0] = 1
if N > 1:
tmp[:, 1:] = x[:, None]
multiply.accumulate(tmp[:, 1:], out=tmp[:, 1:], axis=1)
return v
The function has to cater to a lot more use cases besides yours, so it uses a more generalized method of computation which is reliable, but slower (I'm specifically pointing to multiply.accumulate
). 除了你的功能之外,这个函数还要满足更多的用例,所以它使用了一种更通用的计算方法,它是可靠的,但速度较慢(我特别指的是
multiply.accumulate
)。
As a matter of interest, I found another way of computing the Vandermonde matrix, ending up with this: 令人感兴趣的是,我找到了另一种计算Vandermonde矩阵的方法,结果如下:
%timeit x[:, None] ** np.arange(N)
150 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It does the same thing, but is so much slower. 它做同样的事情,但速度慢得多。 The answer lies in the fact that the operations are broadcast, but inefficiently .
答案在于操作是广播的,但效率低下 。
On the flip side, for float
arrays, this actually ends up performing the best. 另一方面,对于
float
,这实际上最终表现最佳。
Here are some more methods some of which are quite a bit faster (on my computer) than what has been posted so far. 以下是一些方法,其中一些方法(在我的计算机上)比目前发布的方法快得多。
The most important observation I think is that it really depends a lot on how many degrees you want. 我认为最重要的观察是,它在很大程度上取决于你想要多少度数。 Exponentiation (which I believe is special cased for small integer exponents) only makes sense for small exponent ranges.
指数(我认为是小整数指数的特殊情况)只对小指数范围有意义。 The more exponents the better multiplication based approaches fare.
指数越多,基于乘法的方法就越好。
I'd like to highlight a multiply.accumulate
based method ( ma
) which is similar to numpy's builtin approach but faster (and not because I skimped on checks - nnc
, numpy-no-checks demonstrates this). 我想强调一个基于
multiply.accumulate
的方法( ma
),它类似于numpy的内置方法,但更快(而不是因为我在检查上吝啬 - nnc
,numpy-no-checks演示了这一点)。 For all but the smallest exponent ranges it is actually the fastest for me. 除了最小的指数范围以外,它对我来说实际上是最快的。
For reasons I do not understand, the numpy implementation does three things that are to the best of my knowledge slow and unnecessary: (1) It makes quite a few copies of the base vector. 由于我不理解的原因,numpy实现做了三件我认识最慢且不必要的事情:(1)它产生了很多基本向量的副本。 (2) It makes them non-contiguous.
(2)它使它们不连续。 (3) It does the accumulation in-place which I believe forces buffering.
(3)我相信强制缓冲就地积累。
Another thing I'd like to mention is that the fastest for small ranges of ints ( out_e_1
essentially a manual version of ma
), is slowed down by a factor of more than two by the simple precaution of promoting to a larger dtype ( safe_e_1
arguably a bit of a misnomer). 我想提到的另一件事是,对于小范围的整数(
out_e_1
基本上是手动版本的ma
),最快的是通过简单的预防性提升到更大的safe_e_1
而减慢了两倍以上( safe_e_1
可以说是有点用词不当)。
The broadcasting based methods are called bc_*
where *
indicates the broadcast axis (b for base, e for exp) 'cheat' means the result is noncontiguous. 基于广播的方法称为
bc_*
,其中*
表示广播轴(b表示基数,e表示exp)“作弊”表示结果是不连续的。
Timings (best of 3): 计时(最好的3):
rep=100 n_b=5000 n_e=4 b_tp=<class 'numpy.int32'> e_tp=<class 'numpy.int32'>
vander 0.16699657 ms
bc_b 0.09595204 ms
bc_e 0.07959786 ms
ma 0.10755240 ms
nnc 0.16459018 ms
out_e_1 0.02037535 ms
out_e_2 0.02656622 ms
safe_e_1 0.04652272 ms
safe_e_2 0.04081079 ms
cheat bc_e_cheat 0.04668466 ms
rep=100 n_b=5000 n_e=8 b_tp=<class 'numpy.int32'> e_tp=<class 'numpy.int32'>
vander 0.25086462 ms
bc_b apparently failed
bc_e apparently failed
ma 0.15843041 ms
nnc 0.24713077 ms
out_e_1 apparently failed
out_e_2 apparently failed
safe_e_1 0.15970622 ms
safe_e_2 0.19672418 ms
bc_e_cheat apparently failed
rep=100 n_b=5000 n_e=4 b_tp=<class 'float'> e_tp=<class 'numpy.int32'>
vander 0.16225773 ms
bc_b 0.53315020 ms
bc_e 0.56200830 ms
ma 0.07626799 ms
nnc 0.16059748 ms
out_e_1 0.03653416 ms
out_e_2 0.04043702 ms
safe_e_1 0.04060494 ms
safe_e_2 0.04104209 ms
cheat bc_e_cheat 0.52966076 ms
rep=100 n_b=5000 n_e=8 b_tp=<class 'float'> e_tp=<class 'numpy.int32'>
vander 0.24542852 ms
bc_b 2.03353578 ms
bc_e 2.04281270 ms
ma 0.11075758 ms
nnc 0.24212880 ms
out_e_1 0.14809043 ms
out_e_2 0.19261359 ms
safe_e_1 0.15206112 ms
safe_e_2 0.19308420 ms
cheat bc_e_cheat 1.99176601 ms
Code: 码:
import numpy as np
import types
from timeit import repeat
prom={np.dtype(np.int32): np.dtype(np.int64), np.dtype(float): np.dtype(float)}
def RI(k, N, dt, top=100):
return np.random.randint(0, top if top else N, (k, N)).astype(dt)
def RA(k, N, dt, top=None):
return np.add.outer(np.zeros((k,), int), np.arange(N)%(top if top else N)).astype(dt)
def RU(k, N, dt, top=100):
return (np.random.random((k, N))*(top if top else N)).astype(dt)
def data(k, N_b, N_e, dt_b, dt_e, b_fun=RI, e_fun=RA):
b = list(b_fun(k, N_b, dt_b))
e = list(e_fun(k, N_e, dt_e))
return b, e
def f_vander(b, e):
return np.vander(b, len(e), increasing=True)
def f_bc_b(b, e):
return b[:, None]**e
def f_bc_e(b, e):
return np.ascontiguousarray((b**e[:, None]).T)
def f_ma(b, e):
out = np.empty((len(b), len(e)), prom[b.dtype])
out[:, 0] = 1
np.multiply.accumulate(np.broadcast_to(b, (len(e)-1, len(b))), axis=0, out=out[:, 1:].T)
return out
def f_nnc(b, e):
out = np.empty((len(b), len(e)), prom[b.dtype])
out[:, 0] = 1
out[:, 1:] = b[:, None]
np.multiply.accumulate(out[:, 1:], out=out[:, 1:], axis=1)
return out
def f_out_e_1(b, e):
out = np.empty((len(b), len(e)), b.dtype)
out[:, 0] = 1
out[:, 1] = b
out[:, 2] = c = b*b
for i in range(3, len(e)):
c*=b
out[:, i] = c
return out
def f_out_e_2(b, e):
out = np.empty((len(b), len(e)), b.dtype)
out[:, 0] = 1
out[:, 1] = b
out[:, 2] = b*b
for i in range(3, len(e)):
out[:, i] = out[:, i-1] * b
return out
def f_safe_e_1(b, e):
out = np.empty((len(b), len(e)), prom[b.dtype])
out[:, 0] = 1
out[:, 1] = b
out[:, 2] = c = (b*b).astype(prom[b.dtype])
for i in range(3, len(e)):
c*=b
out[:, i] = c
return out
def f_safe_e_2(b, e):
out = np.empty((len(b), len(e)), prom[b.dtype])
out[:, 0] = 1
out[:, 1] = b
out[:, 2] = b*b
for i in range(3, len(e)):
out[:, i] = out[:, i-1] * b
return out
def f_bc_e_cheat(b, e):
return (b**e[:, None]).T
for params in [(100, 5000, 4, np.int32, np.int32),
(100, 5000, 8, np.int32, np.int32),
(100, 5000, 4, float, np.int32),
(100, 5000, 8, float, np.int32)]:
k = params[0]
dat = data(*params)
ref = f_vander(dat[0][0], dat[1][0])
print('rep={} n_b={} n_e={} b_tp={} e_tp={}'.format(*params))
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.allclose(ref, func(dat[0][0], dat[1][0]))
if not func(dat[0][0], dat[1][0]).flags.c_contiguous:
print('cheat', end=' ')
print("{:16s}{:16.8f} ms".format(name[2:], np.min(repeat(
'f(b.pop(), e.pop())', setup='b, e = data(*p)', globals={'f':func, 'data':data, 'p':params}, number=k)) * 1000 / k))
except:
print("{:16s} apparently failed".format(name[2:]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.