Why is looping through pytorch tensors so slow (compared to Numpy)?

Question

I've been working with image transformations recently and came to a situation where I have a large array (shape of 100,000 x 3) where each row represents a point in 3D space like:

pnt = [x y z]

All I'm trying to do is iterating through each point and matrix multiplying each point with a matrix called T (shape = 3 X 3).

Test with Numpy:

def transform(pnt_cloud, T):
    
    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = np.dot(T, pnt)
        
        if xyz_pnt[0] > 0:
            arr[i] = xyz_pnt[0]
            
        i += 1
           
    return arr

Calling the following code and calculating runtime (using %time) gives the output:

Out[190]: CPU times: user 670 ms, sys: 7.91 ms, total: 678 ms
Wall time: 674 ms

Test with Pytorch Tensor:

import torch

tensor_cld = torch.tensor(pnt_cloud)
tensor_T   = torch.tensor(T)

def transform(pnt_cloud, T):
    depth_array = torch.tensor(np.zeros(pnt_cloud.shape[0]))

    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = torch.matmul(T, pnt)
        
        if xyz_pnt[0] > 0:
            depth_array[i] = xyz_pnt[0]
            
        i += 1
            
        
    return depth_array

Calling the following code and calculating runtime (using %time) gives the output:

Out[199]: CPU times: user 6.15 s, sys: 28.1 ms, total: 6.18 s
Wall time: 6.09 s

NOTE: Doing the same with torch.jit only reduces 2s

I would have thought that PyTorch tensor computations would be much faster due to the way PyTorch breaks its code down in the compiling stage. What am I missing here?

Would there be any faster way to do this other than using Numba?

Answer 1

Why are you using a for loop??
Why do you compute a 3x3 dot product and only uses the first element of the result??

You can do all the math in a single matmul :

with torch.no_grad():
  depth_array = torch.matmul(pnt_cloud, T[:1, :].T)  # nx3 dot 3x1 -> nx1
  # since you only want non negative results
  depth_array = torch.maximum(depth_array, 0)

Since you want to compare runtime to numpy, you should disable gradient accumulation .

Answer 2

For the speed, I got this reply from the PyTorch forums:

operations of 1-3 elements are generally rather expensive in PyTorch as the overhead of Tensor creation becomes significant (this includes setting single elements), I think this is the main thing here. This is also the reason why the JIT doesn't help a whole lot (it only takes away the Python overhead) and Numby shines (where eg the assignment to depth_array[i] is just memory write).
the matmul itself might differ in speed if you have different BLAS backends for it in PyTorch vs. NumPy.

Why is looping through pytorch tensors so slow (compared to Numpy)?

Question

Test with Numpy:

Test with Pytorch Tensor:

2 answers

solution1
1 2020-09-30 15:09:20

solution2
0 2020-10-01 22:04:16

Why is looping through pytorch tensors so slow (compared to Numpy)?

Question

Test with Numpy:

Test with Pytorch Tensor:

2 answers

solution1 1 2020-09-30 15:09:20

solution2 0 2020-10-01 22:04:16

solution1
1 2020-09-30 15:09:20

solution2
0 2020-10-01 22:04:16