简体   繁体   English

Tensorflow 第一个 epoch 非常慢(可能与 pool_allocator 有关)

[英]Tensorflow first epoch is extremely slow (maybe related to pool_allocator)

I am training a model built with TF.我正在训练一个用 TF 构建的模型。 At the first epoch, TF is slower than the next epochs by a factor of *100 and I am seeing messages like:在第一个时期,TF 比下一个时期慢 *100 倍,我看到如下消息:

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 958 to 1053

As suggested here , I tried to use tcmalloc by setting LD_PRELOAD="/usr/lib/libtcmalloc.so" , but it didn't help.正如此处所建议的,我尝试通过设置LD_PRELOAD="/usr/lib/libtcmalloc.so"来使用 tcmalloc,但没有帮助。

Any idea on how to make the first epoch run faster?关于如何使第一个纪元运行得更快的任何想法?

It seems that it is a hardware issue.看来这是一个硬件问题。 For the first epoch TF (the same as other DL libraries, like PyTorch as discussed here ) caching information about data as discussed here by @ppwwyyxx对于第一个时期的 TF(与其他 DL 库相同,如此处讨论的 PyTorch)缓存有关数据的信息,如 @ppwwyyxx 此处所讨论

If each data has different size, TF can spend a large amount of time running cudnn benchmarks for each data and store them in cache如果每个数据有不同的大小,TF 可以花费大量时间为每个数据运行 cudnn 基准测试并将它们存储在缓存中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM