[英]Best practices to benchmark deep models on CPU (and not GPU) in PyTorch?
I am little uncertain about how to measure execution time of deep models on CPU in PyTorch ONLY FOR INFERENCE.我不太确定如何在 PyTorch 中测量深度模型在 CPU 上的执行时间,仅用于推断。 I list here some of them but they maybe inaccurate.
我在这里列出了其中的一些,但它们可能不准确。 Please correct them if required and mention more if required.
如果需要,请更正它们,如果需要,请提供更多信息。 I am running on PyTorch version 1.3.1 and Intel Xeon with 64GB RAM, 3.5GHz processor and 8 cores.
我在 PyTorch 版本 1.3.1 和配备 64GB RAM、3.5GHz 处理器和 8 个内核的 Intel Xeon 上运行。
Should we use time.time()
?我们应该使用
time.time()
吗?
with torch.no_grad():
wTime = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
out = model(input) # JUST FOR WARMUP
start.record()
for i in range(200):
input = torch.rand(1,1,2048,2048).to(device)
# beg = time.time() DO NOT USE FOR GPU
got = net_amplifier(low,for_amplifier)
# wTime+=time.time()-beg DO NOT USE FOR GPU
end.record()
torch.cuda.synchronize()
print('execution time in MILLISECONDS: {}'.format(start.elapsed_time(end)/200))
For this code execution was done on GPU.对于此代码执行是在 GPU 上完成的。 If I have to run them on CPU what changes should be made?
如果我必须在 CPU 上运行它们,应该进行哪些更改? Will
time.time()
do? time.time()
会吗?
volatile
?volatile
吗?
input = Variable(torch.randn(1, 3, 227, 227), volatile=True)
model(input)
Should the page cache be cleared?是否应该清除页面缓存?
sudo sh -c "/bin/echo 1 > /proc/sys/vm/drop_caches"
sudo sh -c "/bin/echo 1 > /proc/sys/vm/drop_caches"
Should I remove nn.Sequential()
and directly put in forward part我是否应该删除
nn.Sequential()
并直接放入转发部分
All the methods using copy_ take some time to execute, especially on CPU this might be slow.
所有使用 copy_ 的方法都需要一些时间来执行,尤其是在 CPU 上这可能很慢。 Also the nn.Sequential() modules are slower than just executing them on the forward pass.
此外,nn.Sequential() 模块比仅在前向传递中执行它们要慢。 I think this is due to some overhead that needs to be created when executing the Sequential module.
我认为这是由于在执行 Sequential 模块时需要创建一些开销。
Another thing which i do not understand on the same link is在同一链接上我不明白的另一件事是
If you are running into performance issues with these small numbers, you might try to use torch.set_flush_denormal(True) to disable denormal floating point numbers on the CPU.
如果您遇到这些小数字的性能问题,您可以尝试使用 torch.set_flush_denormal(True) 来禁用 CPU 上的非正规浮点数。
Should torch.set_num_threads(int)
be used?应该使用
torch.set_num_threads(int)
吗? If yes can a demo code be provided?如果可以,可以提供演示代码吗?
What does These context managers are thread local, so they won't work if you send work to another thread using the:module:`threading` module, etc.
mean as given in the documentation .什么
These context managers are thread local, so they won't work if you send work to another thread using the:module:`threading` module, etc.
如文档中给出的那样。
Please list any more issues for calculating execution time in CPU.请列出更多计算 CPU 执行时间的问题。 Thankyou
谢谢
- Should we use time.time()?
我们应该使用 time.time() 吗?
Yes, it's fine for CPU是的,对CPU来说没问题
- Should we use volatile?
我们应该使用易失性吗?
As you said it's deprecated.正如你所说,它已被弃用。 Since
0.4.0
torch.Tensor
was merged with torch.Variable
(it's deprecated as well) and torch.no_grad
context manager should be used.由于
0.4.0
torch.Tensor
与torch.Variable
合并(它也已被弃用)并且应该使用torch.no_grad
上下文管理器。
- Should the page cache be cleared?
是否应该清除页面缓存?
I don't think so unless you know it's a problem我不这么认为,除非你知道这是个问题
- Should I remove nn.Sequential() and directly put in forward part
我是否应该删除 nn.Sequential() 并直接放入转发部分
No, torch.nn.Sequential
should have no or negligible performance burden on your model.不,
torch.nn.Sequential
对您的 model 的性能负担应该没有或可以忽略不计。 It's forward is only:它的前进只是:
def forward(self, input):
for module in self:
input = module(input)
return input
If you are running into performance issues with these small numbers, you might try to use torch.set_flush_denormal(True) to disable denormal floating point numbers on the CPU.
如果您遇到这些小数字的性能问题,您可以尝试使用 torch.set_flush_denormal(True) 来禁用 CPU 上的非正规浮点数。
Flushing denormal numbers ( numbers which underflow ) means replacing them strictly by 0.0
which might help with your performance if you have a lot of really small numbers.刷新非正规数(下溢的数字)意味着将它们严格替换为
0.0
,如果您有很多非常小的数字,这可能有助于您的性能。 Example given by PyTorch docs : PyTorch 文档给出的示例:
>>> torch.set_flush_denormal(True)
True
>>> torch.tensor([1e-323], dtype=torch.float64)
tensor([ 0.], dtype=torch.float64)
>>> torch.set_flush_denormal(False)
True
>>> torch.tensor([1e-323], dtype=torch.float64)
tensor(9.88131e-324 *
[ 1.0000], dtype=torch.float64)
Should torch.set_num_threads(int) be used?
应该使用 torch.set_num_threads(int) 吗? If yes can a demo code be provided?
如果可以,可以提供演示代码吗?
According to this document it might help if you don't allocate too many threads (probably at most as many as cores in your CPU so you might try 8).根据此文档,如果您不分配太多线程(可能最多与 CPU 中的内核一样多,因此您可以尝试 8 个),它可能会有所帮助。
So this piece at the beginning of your code might help:因此,代码开头的这一部分可能会有所帮助:
torch.set_num_threads(8)
You may want to check numbers out and see whether and how much each value helps.您可能想检查数字,看看每个值是否有帮助以及有多大帮助。
What does These context managers are thread local, so they won't work if you send work to another thread using the:module:
threading
module, etc. mean as given in the documentation.什么这些上下文管理器是线程本地的,因此如果您使用:模块:
threading
模块等将工作发送到另一个线程,它们将无法工作。如文档中给出的那样。
If you use module like torch.multiprocessing
and run torch.multiprocessing.spawn
(or a-like) and one of your processes won't get into the context manager block the gradient won't be turned off (in case of torch.no_grad
).如果您使用像
torch.multiprocessing
类的模块并运行torch.multiprocessing.spawn
(或类似)并且您的一个进程不会进入上下文管理器块,则不会关闭渐变(如果是torch.no_grad
)。 Also if you use Python's threading only the threads where the block was run into will have gradients turned off (or on, it depends).此外,如果您使用 Python 的线程,则只有运行块的线程才会关闭渐变(或打开,这取决于)。
This code will make it clear for you:这段代码会让你一目了然:
import threading
import torch
def myfunc(i, tensor):
if i % 2 == 0:
with torch.no_grad():
z = tensor * 2
else:
z = tensor * 2
print(i, z.requires_grad)
if __name__ == "__main__":
tensor = torch.randn(5, requires_grad=True)
with torch.no_grad():
for i in range(10):
t = threading.Thread(target=myfunc, args=(i, tensor))
t.start()
Which outputs (order may vary):哪些输出(顺序可能不同):
0 False
1 True
2 False
3 True
4 False
6 False
5 True
7 True
8 False
9 True
Also notice that torch.no_grad()
in __main__
has no effect on spawned threads (neither would torch.enable_grad
).还要注意
__main__
中的torch.no_grad()
对生成的线程没有影响( torch.enable_grad
也没有)。
Please list any more issues for calculating execution time in CPU.
请列出更多计算 CPU 执行时间的问题。
Converting to torchscript
(see here ) might help, building PyTorch from source targeted at your architecture and it's capabilities and tons of other things, this question is too wide.转换为
torchscript
(参见此处)可能会有所帮助,从针对您的体系结构及其功能和其他大量内容的源代码构建PyTorch,这个问题太宽泛了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.