为什么遍历 pytorch 张量如此缓慢（与 Numpy 相比）？

Question

I've been working with image transformations recently and came to a situation where I have a large array (shape of 100,000 x 3) where each row represents a point in 3D space like:我最近一直在处理图像转换，遇到了一个大数组（形状为 100,000 x 3）的情况，其中每一行代表 3D 空间中的一个点，例如：

pnt = [x y z]

All I'm trying to do is iterating through each point and matrix multiplying each point with a matrix called T (shape = 3 X 3).我想要做的就是迭代每个点，并将每个点与一个名为 T 的矩阵相乘（形状 = 3 X 3）。

Test with Numpy:用 Numpy 测试：

def transform(pnt_cloud, T):
    
    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = np.dot(T, pnt)
        
        if xyz_pnt[0] > 0:
            arr[i] = xyz_pnt[0]
            
        i += 1
           
    return arr

Calling the following code and calculating runtime (using %time) gives the output:调用以下代码并计算运行时间（使用 %time）给出输出：

Out[190]: CPU times: user 670 ms, sys: 7.91 ms, total: 678 ms
Wall time: 674 ms

Test with Pytorch Tensor:使用 Pytorch Tensor 进行测试：

import torch

tensor_cld = torch.tensor(pnt_cloud)
tensor_T   = torch.tensor(T)

def transform(pnt_cloud, T):
    depth_array = torch.tensor(np.zeros(pnt_cloud.shape[0]))

    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = torch.matmul(T, pnt)
        
        if xyz_pnt[0] > 0:
            depth_array[i] = xyz_pnt[0]
            
        i += 1
            
        
    return depth_array

Calling the following code and calculating runtime (using %time) gives the output:调用以下代码并计算运行时间（使用 %time）给出输出：

Out[199]: CPU times: user 6.15 s, sys: 28.1 ms, total: 6.18 s
Wall time: 6.09 s

NOTE: Doing the same with torch.jit only reduces 2s注意：对 torch.jit 做同样的事情只会减少 2s

I would have thought that PyTorch tensor computations would be much faster due to the way PyTorch breaks its code down in the compiling stage.由于 PyTorch 在编译阶段分解其代码的方式，我原以为 PyTorch 张量计算会快得多。 What am I missing here?我在这里缺少什么？

Would there be any faster way to do this other than using Numba?除了使用 Numba 之外，还有什么更快的方法可以做到这一点吗？

Answer 1

Why are you using a for loop??为什么要使用for循环？？
Why do you compute a 3x3 dot product and only uses the first element of the result??为什么要计算 3x3 点积并且只使用结果的第一个元素？

You can do all the math in a single matmul :您可以在单个matmul完成所有数学运算：

with torch.no_grad():
  depth_array = torch.matmul(pnt_cloud, T[:1, :].T)  # nx3 dot 3x1 -> nx1
  # since you only want non negative results
  depth_array = torch.maximum(depth_array, 0)

Since you want to compare runtime to numpy, you should disable gradient accumulation .由于您想将运行时与 numpy 进行比较，您应该禁用梯度累积。

Answer 2

For the speed, I got this reply from the PyTorch forums:对于速度，我从 PyTorch 论坛得到了这个回复：

operations of 1-3 elements are generally rather expensive in PyTorch as the overhead of Tensor creation becomes significant (this includes setting single elements), I think this is the main thing here.在 PyTorch 中，1-3 个元素的操作通常相当昂贵，因为 Tensor 创建的开销变得很大（这包括设置单个元素），我认为这是这里的主要内容。 This is also the reason why the JIT doesn't help a whole lot (it only takes away the Python overhead) and Numby shines (where eg the assignment to depth_array[i] is just memory write).这也是为什么 JIT 没有多大帮助（它只带走 Python 开销）和 Numby 闪耀的原因（例如，对 depth_array[i] 的分配只是内存写入）。
the matmul itself might differ in speed if you have different BLAS backends for it in PyTorch vs. NumPy.如果在 PyTorch 和 NumPy 中使用不同的 BLAS 后端，matmul 本身的速度可能会有所不同。

为什么遍历 pytorch 张量如此缓慢（与 Numpy 相比）？

问题描述

Test with Numpy:用 Numpy 测试：

Test with Pytorch Tensor:使用 Pytorch Tensor 进行测试：

2 个解决方案

解决方案1
1 2020-09-30 15:09:20

解决方案2
0 2020-10-01 22:04:16

为什么遍历 pytorch 张量如此缓慢（与 Numpy 相比）？

问题描述

Test with Numpy:用 Numpy 测试：

Test with Pytorch Tensor:使用 Pytorch Tensor 进行测试：

2 个解决方案

解决方案1 1 2020-09-30 15:09:20

解决方案2 0 2020-10-01 22:04:16

解决方案1
1 2020-09-30 15:09:20

解决方案2
0 2020-10-01 22:04:16