Testing performance using %%timeit, loop speed or total time elapsed more important?

Question

I'm iterating csv files with 10k-100k rows millions of times so performance is of crucial importance. I'm going line by line optimizing every line to squeeze every fraction of a second. This is simple just cloning one dataframe series. Currently confused with these initial results.

    @numba.jit()
    def test(x):
        return x

    #%%timeit
    df['source'] = df['Close']
    #79.5 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    #%%timeit
    df['source2'] = test(df['Close'].to_numpy())
    #88.1 µs ± 683 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The overall elapsed time is faster in the first, but the per loop is faster in the second. If the per loop is faster, I would expect it to be faster overall.

Does this mean that there is that much more time being utilized in the back end? Can someone explain this to me.

Should I give more importance to the total elapsed time or the per loop time?

Note, I'm using Jupyter notebook on Anaconda.

Answer 1

I can think of only one thing that is different for total time and iteration time. This is a garbage collection. Python's garbage collector from time to time checks what variables are not used anymore and frees the memory for them. If you run one loop you probably will never see it working. But if it runs for a long time than most likely it is triggered and starts freeing memory, which takes time. Therefore if there was much more memory allocated during the loop the garbage collector needs more time to free the memory.

This thinking leads to one possible improvement to you code:
Be mindful how much you allocate memory.

df['source'] = df['Close'] could mean that you copy your data from one column to another. Try to reuse data as much as possible. For example, do col1 = df['Close'] and then use col1 for further manipulations. This way the data is not copied and it is faster (in case you do not need column source and it is used temporarily).

There is also another possibility to speed up things by not allocating and deallocating memory. When using Numba, it is faster to iterate over all rows and do calculations in one go instead of iterating over the same data multiple times when you use vector formulas of Numpy/Pandas. You not only save on iterating on data multiple times, but can even use stack variables instead of heap (terminology only relevant to compiled code. Python stores everything in the heap). By using stack variables in Numba you essentially stop allocating and deallocating memory constantly.

Another option could be to preallocate variables before the big loop and reuse them for every itaration of the loop. But this only helps if you have stable variable size for each iteration (meaning number of rows for each CSV file in your case)

Testing performance using %%timeit, loop speed or total time elapsed more important?

Question

1 answers

solution1
-1 2020-02-28 20:54:48

Testing performance using %%timeit, loop speed or total time elapsed more important?

Question

1 answers

solution1 -1 2020-02-28 20:54:48

solution1
-1 2020-02-28 20:54:48