简体   繁体   English

使用 %%timeit 测试性能、循环速度还是总时间更重要?

[英]Testing performance using %%timeit, loop speed or total time elapsed more important?

I'm iterating csv files with 10k-100k rows millions of times so performance is of crucial importance.我用 10k-100k 行迭代 csv 文件数百万次,因此性能至关重要。 I'm going line by line optimizing every line to squeeze every fraction of a second.我将逐行优化每一行以压缩每一分之一秒。 This is simple just cloning one dataframe series.这很简单,只需克隆一个数据帧系列。 Currently confused with these initial results.目前对这些初步结果感到困惑。

    @numba.jit()
    def test(x):
        return x

    #%%timeit
    df['source'] = df['Close']
    #79.5 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    #%%timeit
    df['source2'] = test(df['Close'].to_numpy())
    #88.1 µs ± 683 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The overall elapsed time is faster in the first, but the per loop is faster in the second.第一个的整体运行时间更快,但每个循环在第二个中更快。 If the per loop is faster, I would expect it to be faster overall.如果每个循环更快,我希望它总体上更快。

Does this mean that there is that much more time being utilized in the back end?这是否意味着在后端有更多的时间被利用? Can someone explain this to me.谁可以给我解释一下这个。

Should I give more importance to the total elapsed time or the per loop time?我应该更重视总经过时间还是每个循环时间?

Note, I'm using Jupyter notebook on Anaconda.请注意,我在 Anaconda 上使用 Jupyter 笔记本。

I can think of only one thing that is different for total time and iteration time.我只能想到总时间和迭代时间不同的一件事。 This is a garbage collection.这是一个垃圾收集。 Python's garbage collector from time to time checks what variables are not used anymore and frees the memory for them. Python 的垃圾收集器会不时检查哪些变量不再使用并为它​​们释放内存。 If you run one loop you probably will never see it working.如果您运行一个循环,您可能永远不会看到它工作。 But if it runs for a long time than most likely it is triggered and starts freeing memory, which takes time.但是如果它运行很长时间,它很可能会被触发并开始释放内存,这需要时间。 Therefore if there was much more memory allocated during the loop the garbage collector needs more time to free the memory.因此,如果在循环期间分配了更多内存,垃圾收集器需要更多时间来释放内存。

This thinking leads to one possible improvement to you code:这种想法会导致对您的代码进行一项可能的改进:
Be mindful how much you allocate memory.注意你分配了多少内存。

df['source'] = df['Close'] could mean that you copy your data from one column to another. df['source'] = df['Close']可能意味着您将数据从一列复制到另一列。 Try to reuse data as much as possible.尽量重用数据。 For example, do col1 = df['Close'] and then use col1 for further manipulations.例如,执行col1 = df['Close']然后使用col1进行进一步操作。 This way the data is not copied and it is faster (in case you do not need column source and it is used temporarily).这样数据不会被复制并且速度更快(以防您不需要列source并且临时使用它)。

There is also another possibility to speed up things by not allocating and deallocating memory.还有另一种可能通过不分配和取消分配内存来加快速度。 When using Numba, it is faster to iterate over all rows and do calculations in one go instead of iterating over the same data multiple times when you use vector formulas of Numpy/Pandas.使用 Numba 时,在使用 Numpy/Pandas 的向量公式时,迭代所有行并一次性进行计算会更快,而不是多次迭代相同的数据。 You not only save on iterating on data multiple times, but can even use stack variables instead of heap (terminology only relevant to compiled code. Python stores everything in the heap).您不仅可以节省多次迭代数据,还可以使用堆栈变量而不是堆(术语仅与编译代码相关。Python 将所有内容存储在堆中)。 By using stack variables in Numba you essentially stop allocating and deallocating memory constantly.通过在 Numba 中使用堆栈变量,您基本上可以停止不断分配和释放内存。

Another option could be to preallocate variables before the big loop and reuse them for every itaration of the loop.另一种选择是在大循环之前预分配变量,并在循环的每次迭代中重用它们。 But this only helps if you have stable variable size for each iteration (meaning number of rows for each CSV file in your case)但这只有在每次迭代都有稳定的可变大小时才有帮助(这意味着在您的情况下每个 CSV 文件的行数)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM