简体繁体 English

如何训练大于 GPU memory 的 TF model？

[英]How to train a TF model that is larger than GPU memory?

原文 2021-03-02 16:25:37 7 1 python/ tensorflow/ object-detection/ object-detection-api

I want to train a large object detection model using TF2, preferrably the EfficientDet D7 network.我想使用 TF2 训练一个大型 object 检测 model，最好是EfficientDet D7网络。 With my Tesla P100 card that has 16 GB of memory I am running into an "out of memory" exception, ie not enough memory on the graphics card can be allocated.我的 Tesla P100 卡有 16 GB 的 memory 我遇到了“内存不足”异常，即显卡上没有足够的 memory 可以分配。

So I am wondering what my options are in this case.所以我想知道在这种情况下我的选择是什么。 Is it correct that if I would have multiple GPUs, then the TF model would be split so that it fills memory of both cards?如果我有多个 GPU，那么 TF model 将被拆分以填充两张卡的 memory 是否正确？ So in my case, with a second Tesla card again with 16 GB I would have 32 GB in total during training?所以在我的情况下，再次使用第二张 16 GB 的 Tesla 卡，我在训练期间总共会有 32 GB？ If that is the case would that also be true for a cloud provider, where I could utilize multiple GPUs?如果是这样的话，对于云提供商来说也是如此，我可以在哪里使用多个 GPU？

Moreover, if I am wrong and it would not work to split a model for multiple GPUs during training, what other approach would work in order to train a large network that does not fit into my GPU memory?此外，如果我错了，并且在训练期间为多个 GPU 拆分 model 不起作用，那么还有什么其他方法可以训练不适合我的 GPU ZCD69B4957F06CD818D7BF3D619Z80 的大型网络？

PS: I know that I could reduce the batch_size to 1, but unfortunately that does still not solve my issue for the really large models... PS：我知道我可以将batch_size减少到 1，但不幸的是，对于真正的大模型，这仍然不能解决我的问题......

1 个解决方案

You can use multiple GPU's in GCP (Google Cloud Platform) atleast, not too sure about other cloud providers.您至少可以在 GCP（谷歌云平台）中使用多个 GPU，不太确定其他云提供商。 And yes, once you do that, you can train with a larger batch size (exact number would depend on the GPU, it's memory and how may you GPU's you have running in your VM)是的，一旦你这样做了，你可以用更大的批量进行训练（确切的数字取决于 GPU，它是 memory 以及你的 GPU 如何在你的 VM 中运行）

You can check this link for the list of all GPU's available in GCP您可以查看此链接以获取 GCP 中所有可用 GPU 的列表

If you're using the object detection API, you can check this post regarding training using multiple GPU's.如果您使用的是 object 检测 API，您可以查看这篇关于使用多个 GPU 进行训练的帖子。

Alternatively, if you want to go with a single GPU, one clever trick would be to use the concept of gradient accumulation where you could virtually increase your batch size without using too much extra GPU memory, which is discussed in this post Alternatively, if you want to go with a single GPU, one clever trick would be to use the concept of gradient accumulation where you could virtually increase your batch size without using too much extra GPU memory, which is discussed in this post