在 PyTorch 中的 CPU（而非 GPU）上对深度模型进行基准测试的最佳实践？

Question

I am little uncertain about how to measure execution time of deep models on CPU in PyTorch ONLY FOR INFERENCE.我不太确定如何在 PyTorch 中测量深度模型在 CPU 上的执行时间，仅用于推断。 I list here some of them but they maybe inaccurate.我在这里列出了其中的一些，但它们可能不准确。 Please correct them if required and mention more if required.如果需要，请更正它们，如果需要，请提供更多信息。 I am running on PyTorch version 1.3.1 and Intel Xeon with 64GB RAM, 3.5GHz processor and 8 cores.我在 PyTorch 版本 1.3.1 和配备 64GB RAM、3.5GHz 处理器和 8 个内核的 Intel Xeon 上运行。

Should we use time.time() ?我们应该使用time.time()吗？
- I know that for GPU this is a very bad idea.我知道对于 GPU 这是一个非常糟糕的主意。 For GPU I do as follows对于 GPU 我做如下

with torch.no_grad():
    wTime = 0
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    out = model(input) # JUST FOR WARMUP

    start.record()
    for i in range(200):
        input = torch.rand(1,1,2048,2048).to(device)

#        beg = time.time() DO NOT USE FOR GPU

        got = net_amplifier(low,for_amplifier)

#        wTime+=time.time()-beg DO NOT USE FOR GPU
    end.record()
    torch.cuda.synchronize()

    print('execution time in MILLISECONDS: {}'.format(start.elapsed_time(end)/200))

For this code execution was done on GPU.对于此代码执行是在 GPU 上完成的。 If I have to run them on CPU what changes should be made?如果我必须在 CPU 上运行它们，应该进行哪些更改？ Will time.time() do? time.time()会吗？

Should we use volatile ?我们应该使用volatile吗？
- I think the use if volatile is now discouraged after v0.3.我认为在 v0.3 之后不鼓励使用 if volatile。 But will it still help if I use the eval mode and no_grad()但是如果我使用 eval 模式和 no_grad() 仍然有帮助吗

input = Variable(torch.randn(1, 3, 227, 227), volatile=True) 
 model(input)

Should the page cache be cleared?是否应该清除页面缓存？
- One way of doing this that I know is using sudo sh -c "/bin/echo 1 > /proc/sys/vm/drop_caches"我知道的一种方法是使用sudo sh -c "/bin/echo 1 > /proc/sys/vm/drop_caches"
Should I remove nn.Sequential() and directly put in forward part我是否应该删除nn.Sequential()并直接放入转发部分
- According to this link根据这个链接

All the methods using copy_ take some time to execute, especially on CPU this might be slow.所有使用 copy_ 的方法都需要一些时间来执行，尤其是在 CPU 上这可能很慢。 Also the nn.Sequential() modules are slower than just executing them on the forward pass.此外，nn.Sequential() 模块比仅在前向传递中执行它们要慢。 I think this is due to some overhead that needs to be created when executing the Sequential module.我认为这是由于在执行 Sequential 模块时需要创建一些开销。

Another thing which i do not understand on the same link is在同一链接上我不明白的另一件事是

If you are running into performance issues with these small numbers, you might try to use torch.set_flush_denormal(True) to disable denormal floating point numbers on the CPU.如果您遇到这些小数字的性能问题，您可以尝试使用 torch.set_flush_denormal(True) 来禁用 CPU 上的非正规浮点数。

Should torch.set_num_threads(int) be used?应该使用torch.set_num_threads(int)吗？ If yes can a demo code be provided?如果可以，可以提供演示代码吗？
What does These context managers are thread local, so they won't work if you send work to another thread using the:module:`threading` module, etc. mean as given in the documentation .什么These context managers are thread local, so they won't work if you send work to another thread using the:module:`threading` module, etc.如文档中给出的那样。

Please list any more issues for calculating execution time in CPU.请列出更多计算 CPU 执行时间的问题。 Thankyou谢谢

Answer 1

Should we use time.time()?我们应该使用 time.time() 吗？

Yes, it's fine for CPU是的，对CPU来说没问题

Should we use volatile?我们应该使用易失性吗？

As you said it's deprecated.正如你所说，它已被弃用。 Since 0.4.0 torch.Tensor was merged with torch.Variable (it's deprecated as well) and torch.no_grad context manager should be used.由于0.4.0 torch.Tensor与torch.Variable合并（它也已被弃用）并且应该使用torch.no_grad上下文管理器。

Should the page cache be cleared?是否应该清除页面缓存？

I don't think so unless you know it's a problem我不这么认为，除非你知道这是个问题

Should I remove nn.Sequential() and directly put in forward part我是否应该删除 nn.Sequential() 并直接放入转发部分

No, torch.nn.Sequential should have no or negligible performance burden on your model.不， torch.nn.Sequential对您的 model 的性能负担应该没有或可以忽略不计。 It's forward is only:它的前进只是：

def forward(self, input):
    for module in self:
        input = module(input)
    return input

If you are running into performance issues with these small numbers, you might try to use torch.set_flush_denormal(True) to disable denormal floating point numbers on the CPU.如果您遇到这些小数字的性能问题，您可以尝试使用 torch.set_flush_denormal(True) 来禁用 CPU 上的非正规浮点数。

Flushing denormal numbers ( numbers which underflow ) means replacing them strictly by 0.0 which might help with your performance if you have a lot of really small numbers.刷新非正规数（下溢的数字）意味着将它们严格替换为0.0 ，如果您有很多非常小的数字，这可能有助于您的性能。 Example given by PyTorch docs : PyTorch 文档给出的示例：

>>> torch.set_flush_denormal(True)
True
>>> torch.tensor([1e-323], dtype=torch.float64)
tensor([ 0.], dtype=torch.float64)
>>> torch.set_flush_denormal(False)
True
>>> torch.tensor([1e-323], dtype=torch.float64)
tensor(9.88131e-324 *
       [ 1.0000], dtype=torch.float64)

Should torch.set_num_threads(int) be used?应该使用 torch.set_num_threads(int) 吗？ If yes can a demo code be provided?如果可以，可以提供演示代码吗？

According to this document it might help if you don't allocate too many threads (probably at most as many as cores in your CPU so you might try 8).根据此文档，如果您不分配太多线程（可能最多与 CPU 中的内核一样多，因此您可以尝试 8 个），它可能会有所帮助。

So this piece at the beginning of your code might help:因此，代码开头的这一部分可能会有所帮助：

torch.set_num_threads(8)

You may want to check numbers out and see whether and how much each value helps.您可能想检查数字，看看每个值是否有帮助以及有多大帮助。

What does These context managers are thread local, so they won't work if you send work to another thread using the:module: threading module, etc. mean as given in the documentation.什么这些上下文管理器是线程本地的，因此如果您使用：模块： threading模块等将工作发送到另一个线程，它们将无法工作。如文档中给出的那样。

If you use module like torch.multiprocessing and run torch.multiprocessing.spawn (or a-like) and one of your processes won't get into the context manager block the gradient won't be turned off (in case of torch.no_grad ).如果您使用像torch.multiprocessing类的模块并运行torch.multiprocessing.spawn （或类似）并且您的一个进程不会进入上下文管理器块，则不会关闭渐变（如果是torch.no_grad ）。 Also if you use Python's threading only the threads where the block was run into will have gradients turned off (or on, it depends).此外，如果您使用 Python 的线程，则只有运行块的线程才会关闭渐变（或打开，这取决于）。

This code will make it clear for you:这段代码会让你一目了然：

import threading

import torch


def myfunc(i, tensor):
    if i % 2 == 0:
        with torch.no_grad():
            z = tensor * 2
    else:
        z = tensor * 2
    print(i, z.requires_grad)


if __name__ == "__main__":
    tensor = torch.randn(5, requires_grad=True)
    with torch.no_grad():
        for i in range(10):
            t = threading.Thread(target=myfunc, args=(i, tensor))
            t.start()

Which outputs (order may vary):哪些输出（顺序可能不同）：

0 False
1 True
2 False
3 True
4 False
6 False
5 True
7 True
8 False
9 True

Also notice that torch.no_grad() in __main__ has no effect on spawned threads (neither would torch.enable_grad ).还要注意__main__中的torch.no_grad()对生成的线程没有影响（ torch.enable_grad也没有）。

Please list any more issues for calculating execution time in CPU.请列出更多计算 CPU 执行时间的问题。

Converting to torchscript (see here ) might help, building PyTorch from source targeted at your architecture and it's capabilities and tons of other things, this question is too wide.转换为torchscript （参见此处）可能会有所帮助，从针对您的体系结构及其功能和其他大量内容的源代码构建PyTorch，这个问题太宽泛了。

在 PyTorch 中的 CPU（而非 GPU）上对深度模型进行基准测试的最佳实践？

问题描述

1 个解决方案

解决方案1
5 已采纳 2020-05-19 08:35:21

在 PyTorch 中的 CPU（而非 GPU）上对深度模型进行基准测试的最佳实践？

问题描述

1 个解决方案

解决方案1 5 已采纳 2020-05-19 08:35:21

解决方案1
5 已采纳 2020-05-19 08:35:21