繁体   English   中英

Tensorflow/WSL2 GPU 内存不足,没有使用所有可用的?

[英]Tensorflow/WSL2 GPU out of memory, not using all available?

所以我试图在 WSL2 中的 TITAN RTX (24G) 上微调中型模型,但它似乎内存不足? 小模型适合。 如果我在实时 ubuntu 上启动我的计算机,我可以在问题上训练中型和大型模型。

2020-09-23 13:19:36.310992: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23b7a0000 next 260 of size 4194304
2020-09-23 13:19:36.310995: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23bba0000 next 266 of size 16777216
2020-09-23 13:19:36.310998: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23cba0000 next 268 of size 16777216
2020-09-23 13:19:36.311001: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23dba0000 next 270 of size 12582912
2020-09-23 13:19:36.311004: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23e7a0000 next 272 of size 4194304
2020-09-23 13:19:36.311006: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23eba0000 next 278 of size 16777216
2020-09-23 13:19:36.311009: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23fba0000 next 280 of size 16777216
2020-09-23 13:19:36.311012: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x240ba0000 next 282 of size 12582912
2020-09-23 13:19:36.311015: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2417a0000 next 284 of size 4194304
2020-09-23 13:19:36.311020: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x241ba0000 next 290 of size 16777216
2020-09-23 13:19:36.311023: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x242ba0000 next 18446744073709551615 of size 29360128
2020-09-23 13:19:36.311026: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 130543104
2020-09-23 13:19:36.311029: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2447a0000 next 294 of size 12582912
2020-09-23 13:19:36.311032: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2453a0000 next 296 of size 4194304
2020-09-23 13:19:36.311035: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2457a0000 next 302 of size 16777216
2020-09-23 13:19:36.311037: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2467a0000 next 304 of size 16777216
2020-09-23 13:19:36.311040: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2477a0000 next 306 of size 12582912
2020-09-23 13:19:36.311043: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2483a0000 next 308 of size 4194304
2020-09-23 13:19:36.311046: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2487a0000 next 314 of size 16777216
2020-09-23 13:19:36.311049: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2497a0000 next 316 of size 16777216
2020-09-23 13:19:36.311052: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24a7a0000 next 318 of size 12582912
2020-09-23 13:19:36.311055: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24b3a0000 next 320 of size 4194304
2020-09-23 13:19:36.311058: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x24b7a0000 next 18446744073709551615 of size 13102592
2020-09-23 13:19:36.311061: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size: 
2020-09-23 13:19:36.311065: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 98 Chunks of size 256 totalling 24.5KiB
2020-09-23 13:19:36.311069: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 113 Chunks of size 4096 totalling 452.0KiB
2020-09-23 13:19:36.311073: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 12288 totalling 228.0KiB
2020-09-23 13:19:36.311076: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 16384 totalling 288.0KiB
2020-09-23 13:19:36.311079: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 32256 totalling 31.5KiB
2020-09-23 13:19:36.311083: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 4194304 totalling 76.00MiB
2020-09-23 13:19:36.311086: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 12582912 totalling 216.00MiB
2020-09-23 13:19:36.311089: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 36 Chunks of size 16777216 totalling 576.00MiB
2020-09-23 13:19:36.311093: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29360128 totalling 28.00MiB
2020-09-23 13:19:36.311096: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 268435456 totalling 256.00MiB
2020-09-23 13:19:36.311099: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 1.13GiB
2020-09-23 13:19:36.311102: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 1222110720 memory_limit_: 68719476736 available bytes: 67497366016 curr_region_allocation_bytes_: 2147483648
2020-09-23 13:19:36.311108: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 68719476736
InUse:                  1209008128
MaxInUse:               1209008128
NumAllocs:                     762
MaxAllocSize:            268435456```

Not sure what to do from here..

OOM 问题可能有多种原因,以下是一些常见原因和解决问题的解决方法。

  • 确保您没有在同一个 GPU 上运行评估和训练,这将导致过程停滞并导致 OOM 问题。 您可以尝试在不同的 GPU 上进行评估。
  • 减少batch size会减慢您的训练速度,但会避免 OOM 问题。
  • 如果你有大数据,那么如果是图像数据尝试减小大小或者使用可以使用tf.data.Dataset格式来减少内存消耗。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM