简体   繁体   中英

Successfully trained Model.fit but prompted resource error when running model.predict on collab

I am using google Collab to train a 3D autoencoder. I successfully trained the model using model.fit function with the following model.summary():

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 128, 128, 128, 1)  0         
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 64, 64, 64, 64)    1792      
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 32, 32, 32, 128)   221312    
_________________________________________________________________
conv3d_3 (Conv3D)            (None, 16, 16, 16, 256)   884992    
_________________________________________________________________
conv3d_4 (Conv3D)            (None, 8, 8, 8, 256)      1769728   
_________________________________________________________________
conv3d_5 (Conv3D)            (None, 8, 8, 8, 256)      1769728   
_________________________________________________________________
up_sampling3d_1 (UpSampling3 (None, 16, 16, 16, 256)   0         
_________________________________________________________________
conv3d_6 (Conv3D)            (None, 16, 16, 16, 256)   1769728   
_________________________________________________________________
up_sampling3d_2 (UpSampling3 (None, 32, 32, 32, 256)   0         
_________________________________________________________________
conv3d_7 (Conv3D)            (None, 32, 32, 32, 128)   884864    
_________________________________________________________________
up_sampling3d_3 (UpSampling3 (None, 64, 64, 64, 128)   0         
_________________________________________________________________
conv3d_8 (Conv3D)            (None, 64, 64, 64, 64)    221248    
_________________________________________________________________
up_sampling3d_4 (UpSampling3 (None, 128, 128, 128, 64) 0         
_________________________________________________________________
conv3d_9 (Conv3D)            (None, 128, 128, 128, 1)  1729      
=================================================================
Total params: 7,525,121
Trainable params: 7,525,121
Non-trainable params: 0

the training is successful, and I saved the model as model.h5

i ran a seperate cell within the same project to test the model with the following code:

from tensorflow.keras.models import load_model
from keras import backend as K
K.clear_session()
model = load_model('model.h5')
x_test = np.load('test.npy')
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = np.reshape(x_train, (len(x_train), 128, 128, 128, 1))
x_test = np.reshape(x_test, (len(x_test), 128, 128, 128, 1))
decoded_imgs = model.predict(x_test)

and it prompts me the following error code:

ResourceExhaustedError: OOM when allocating tensor with shape[25,128,128,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/up_sampling3d_4/concat_1 (defined at:47) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_predict_function_63155] Function call stack: predict_function

why is it possible that I am able to train the model with the same system but not able to run the model.predict?: anyone has an answer please :(

I am using googla colab pro with the following GPU specs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    48W / 250W |  15559MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Tensorflow errors can be really painful to parse out. I see them all the time, and yet I have no idea what this one means. I have a couple suggestions on things to try out. First, see if it really is a resource error and feed in just one example to predict.

x_test_single = np.reshape(x_test[0], (1, 128, 128, 128, 1))
model.predict(x_test_single)

Secondly, typically if you want to test an entire set you'd use model.evaluate, see if that works. Lastly, if you really are having resource allocation issues (ie you can't fit the as much as you want to in GPU memory since you have 3D sets that are super memory hungry) then use the data API and make datasets and feed in batches.

Just one more suggestion, I find it better to use the tensorflow SavedModel format if you are saving. https://www.tensorflow.org/tutorials/keras/save_and_load#savedmodel_format

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM