为什么 np.linalg.norm(...,axis=1) 比写出向量范数的公式慢？

Question

To normalize the rows of a matrix X to unit length, I usually use:要将矩阵X的行归一化为单位长度，我通常使用：

X /= np.linalg.norm(X, axis=1, keepdims=True)

Trying to optimize this operation for an algorithm, I was quite surprised to see that writing out the normalization is about 40% faster on my machine:尝试为算法优化此操作时，我很惊讶地发现在我的机器上写出归一化的速度大约快 40%：

X /= np.sqrt(X[:,0]**2+X[:,1]**2+X[:,2]**2)[:,np.newaxis]
X /= np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]

How comes?怎么会？ Where is the performance lost in np.linalg.norm() ? np.linalg.norm()的性能损失在np.linalg.norm() ？

import numpy as np
X = np.random.randn(10000,3)

%timeit X/np.linalg.norm(X,axis=1, keepdims=True)
# 276 µs ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit X/np.sqrt(X[:,0]**2+X[:,1]**2+X[:,2]**2)[:,np.newaxis]
# 169 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit X/np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
# 185 µs ± 4.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I observe this for (1) python3.6 + numpy v1.17.2 and (2) python3.9 + numpy v1.19.3 on a MacbookPro 2015 with OpenBLAS support.我在支持(2) python3.9 + numpy v1.19.3的 MacbookPro 2015 上观察到(1) python3.6 + numpy v1.17.2和(2) python3.9 + numpy v1.19.3 。

I don't think this is a duplicate of this post , which addresses matrix norms, while this one is about the L2-norm of vectors.我不认为这是这篇文章的副本，它解决了矩阵范数，而这个是关于向量的 L2 范数。

Answer 1

The source code for row-wise L2-norm boils down to the following lines of code: row-wise L2-norm 的源代码归结为以下几行代码：

def norm(x, keepdims=False):
    x = np.asarray(x)
    s = x**2
    return np.sqrt(s.sum(axis=(1,), keepdims=keepdims))

The simplified code assumes real-valued x and makes use of the fact that np.add.reduce(s, ...) is equivalent to s.sum(...) .简化代码假设x实值，并利用np.add.reduce(s, ...)等价于s.sum(...)的事实。

The OP question therefore is the same as asking why np.sum(x,axis=1) is slower than sum(x[:,i] for i in range(x.shape[1])) :因此，OP 问题与询问为什么np.sum(x,axis=1)比sum(x[:,i] for i in range(x.shape[1])) ：

%timeit X.sum(axis=1, keepdims=False)
# 131 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum(X[:,i] for i in range(X.shape[1]))
# 36.7 µs ± 91.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This question has been answered already here .这个问题已经在这里回答了。 In short, the reduction ( .sum(axis=1) ) comes with overhead costs that generally pay off in terms of floating-point precision and speed (eg cache mechanics, parallelism), but don't in the special case of a reduction over just three columns.简而言之，减少（ .sum(axis=1) ）带来的开销成本通常在浮点精度和速度（例如缓存机制，并行性）方面得到回报，但在减少的特殊情况下不会仅超过三列。 In this case, the overhead is relatively large compared to the actual computation.在这种情况下，与实际计算相比，开销相对较大。

The situation changes if X has more columns.如果X有更多列，情况就会改变。 The numpy-boosted normalization now is substantially faster than the reduction using a python for-loop: numpy-boosted 标准化现在比使用 python for 循环的减少快得多：

X = np.random.randn(10000,100)
%timeit X/np.linalg.norm(X,axis=1, keepdims=True)
# 3.36 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit X/np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
# 5.92 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another related SO thread is found here: numpy ufuncs vs. for loop .另一个相关的 SO 线程在这里找到： numpy ufuncs vs. for loop 。

The question remains why common special cases for reduction (such as the summation over the columns or rows of a matrix with low axis dimension) are not treated by numpy explicitly.问题仍然是为什么 numpy 没有明确处理常见的特殊简化情况（例如对具有低轴维数的矩阵的列或行求和）。 Maybe it's because the effect of such optimizations often depends strongly on the target machine and increases code complexity considerably.可能是因为这种优化的效果往往强烈依赖于目标机器，并大大增加了代码的复杂性。

为什么 np.linalg.norm(...,axis=1) 比写出向量范数的公式慢？

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-11-22 02:54:12

为什么 np.linalg.norm(...,axis=1) 比写出向量范数的公式慢？

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-11-22 02:54:12

解决方案1
4 已采纳 2020-11-22 02:54:12