Python loop faster than Pandas

Question

The below code will show that using python loop is faster than using Pandas. My understanding before I tested it was different. So I'm wondering am I using pandas wrongly for this operation? The below code shows that Pandas solution is about 7 times slower:

Pandas time  0.0008931159973144531
Loop time    0.0001239776611328125

Code:

import pandas as pd
import numpy as np
import time
import torch

batch_size = 5
classes = 4
raw_target = torch.from_numpy(np.array([1, 0, 3, 2, 0]))
rows = np.array(range(batch_size))

t0 = time.time()
zeros = pd.DataFrame(0, index=range(batch_size), columns=range(classes))
zeros.iloc[[rows, raw_target.numpy()]] = 1
t1 = time.time()

print("Pandas time ", t1-t0)
t0 = time.time()

target = raw_target.numpy()
zeros = np.zeros((batch_size, classes), dtype=np.float64)
for zero, target in zip(zeros, target):
    zero[target] = 1

t1 = time.time()
print("Loop  time  ", t1-t0)

The code uses PyTorch because the actual code where the problem exists uses PyTorch . What could be better/optimal solution to this example? The resulting matrix is:

[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]

Answer 1

Depending on your use-case, having everything running through PyTorch could be advantageous (eg to keep all computations on the GPU).

The PyTorch-only solution would follow the numpy syntax (ie zeros[rows, raw_target] = 1. ):

import numpy as np
import torch

batch_size = 5
classes = 4
raw_target = torch.from_numpy(np.array([1, 0, 3, 2, 0]))
rows = torch.range(0, batch_size-1, dtype=torch.int64)

x = torch.zeros((batch_size, classes), dtype=torch.float64)
x[rows, raw_target] = 1.

print(x.detach())
# tensor([[ 0.,  1.,  0.,  0.],
#         [ 1.,  0.,  0.,  0.],
#         [ 0.,  0.,  0.,  1.],
#         [ 0.,  0.,  1.,  0.],
#         [ 1.,  0.,  0.,  0.]], dtype=torch.float64)

Answer 2

You should indeed expect pandas code that works on large data to be faster than iterating over it and zipping with Python. One of the reasons is that Pandas/Numpy can work on the underlying continuous data, whereas with the for loop you have an overhead for creating all the Python objects. You are not seeing that in your profiling as your example data is too small, thus the measures are mostly the setup code.

When doing time profiling you need to take care that you are measuring exactly what you are interested in, and that your measures are repeatable (not drowned in noise).

Here you have very little data (only 5x5), whereas your actual data is probably much larger.

A couple of tips:

don't measure setup code (like the creation of the pandas object, which is probably only done once)
measure with iPython %timeit to get statistical information and not noisy measure
measure with data large enough to see a difference in what you are measuring. there is no need to optimize operations on a 5x5 matrix

As for the practical solution for you problem, pandas only uses numpy anyhow to represent the data. You can skip pandas and go directly to numpy:

zeros[rows, target] = 1

Python loop faster than Pandas

Question

2 answers

solution1
2 2018-06-11 10:22:02

solution2
1 2018-06-11 10:21:33

Python loop faster than Pandas

Question

2 answers

solution1 2 2018-06-11 10:22:02

solution2 1 2018-06-11 10:21:33

solution1
2 2018-06-11 10:22:02

solution2
1 2018-06-11 10:21:33