繁体   English   中英

成功训练 Model.fit 但在 collab 上运行 model.predict 时提示资源错误

[英]Successfully trained Model.fit but prompted resource error when running model.predict on collab

我正在使用 google Collab 来训练 3D 自动编码器。 我使用 model.fit function 成功训练了 model 和以下 Z20F35E630DAF44DBFA4C3F6Dmary

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 128, 128, 128, 1)  0         
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 64, 64, 64, 64)    1792      
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 32, 32, 32, 128)   221312    
_________________________________________________________________
conv3d_3 (Conv3D)            (None, 16, 16, 16, 256)   884992    
_________________________________________________________________
conv3d_4 (Conv3D)            (None, 8, 8, 8, 256)      1769728   
_________________________________________________________________
conv3d_5 (Conv3D)            (None, 8, 8, 8, 256)      1769728   
_________________________________________________________________
up_sampling3d_1 (UpSampling3 (None, 16, 16, 16, 256)   0         
_________________________________________________________________
conv3d_6 (Conv3D)            (None, 16, 16, 16, 256)   1769728   
_________________________________________________________________
up_sampling3d_2 (UpSampling3 (None, 32, 32, 32, 256)   0         
_________________________________________________________________
conv3d_7 (Conv3D)            (None, 32, 32, 32, 128)   884864    
_________________________________________________________________
up_sampling3d_3 (UpSampling3 (None, 64, 64, 64, 128)   0         
_________________________________________________________________
conv3d_8 (Conv3D)            (None, 64, 64, 64, 64)    221248    
_________________________________________________________________
up_sampling3d_4 (UpSampling3 (None, 128, 128, 128, 64) 0         
_________________________________________________________________
conv3d_9 (Conv3D)            (None, 128, 128, 128, 1)  1729      
=================================================================
Total params: 7,525,121
Trainable params: 7,525,121
Non-trainable params: 0

训练成功,我将 model 保存为 model.h5

我在同一个项目中运行了一个单独的单元以使用以下代码测试 model:

from tensorflow.keras.models import load_model
from keras import backend as K
K.clear_session()
model = load_model('model.h5')
x_test = np.load('test.npy')
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = np.reshape(x_train, (len(x_train), 128, 128, 128, 1))
x_test = np.reshape(x_test, (len(x_test), 128, 128, 128, 1))
decoded_imgs = model.predict(x_test)

它提示我以下错误代码:

ResourceExhaustedError:OOM 分配具有形状 [25,128,128,64,64] 的张量并在 /job:localhost/replica:0/task:0/device:GPU:0 上由分配器 GPU_0_bfc [[node model_1/up_sampling3d_4/concat_1 定义at:47) ]] 提示:如果您想在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。 [Op:__inference_predict_function_63155] Function 调用堆栈:predict_function

为什么我可以使用相同的系统训练 model 但无法运行 model.predict?:任何人都有答案请 :(

我正在使用具有以下 GPU 规格的 googla colab pro:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    48W / 250W |  15559MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Tensorflow 错误解析出来真的很痛苦。 我一直看到他们,但我不知道这个是什么意思。 我有一些关于尝试的建议。 首先,看看它是否真的是一个资源错误,并仅提供一个示例进行预测。

x_test_single = np.reshape(x_test[0], (1, 128, 128, 128, 1))
model.predict(x_test_single)

其次,通常如果你想测试整个集合,你会使用 model.evaluate,看看它是否有效。 Lastly, if you really are having resource allocation issues (ie you can't fit the as much as you want to in GPU memory since you have 3D sets that are super memory hungry) then use the data API and make datasets and feed in batches .

还有一个建议,如果您要保存,我发现使用 tensorflow SavedModel 格式会更好。 https://www.tensorflow.org/tutorials/keras/save_and_load#savedmodel_format

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM