简体   繁体   中英

Keras with tensorflow-gpu totally freezes PC

I have pretty simple architecture lstm NN. After few epoch 1-2 my PC totally freezes I can't even move my mouse :

Layer (type)                 Output Shape              Param #   
=================================================================
lstm_4 (LSTM)                (None, 128)               116224    
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 98)                12642     
=================================================================
Total params: 128,866
Trainable params: 128,866
Non-trainable params: 0

    # Same problem  with 2 layers LSTM  with dropout and Adam optimizer

    SEQUENCE_LENGTH =3, len(chars) = 98
    model = Sequential()
    model.add(LSTM(128, input_shape = (SEQUENCE_LENGTH, len(chars))))
    #model.add(Dropout(0.15))
    #model.add(LSTM(128))
    model.add(Dropout(0.10))
    model.add(Dense(len(chars), activation = 'softmax'))

    model.compile(loss = 'categorical_crossentropy', optimizer = RMSprop(lr=0.01), metrics=['accuracy'])

This is how I train:

history = model.fit(X, y, validation_split=0.20, batch_size=128, epochs=10, shuffle=True,verbose=2).history

NN needs 5 minutes to finish 1 epoch. Higher size of batch doesn't mean that problem will occur faster. But more complex model can train more time achieving almost same accuracy - about 0.46 (full code here )

I have last up to date Linux Mint, 1070ti with 8GB, 32Gb ram

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:08:00.0 On | N/A |
| 0% 35C P8 10W / 180W | 303MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Libraries:

Keras==2.2.0
Keras-Applications==1.0.2
Keras-Preprocessing==1.0.1
keras-sequential-ascii==0.1.1
keras-tqdm==2.0.1
tensorboard==1.8.0
tensorflow==1.0.1
tensorflow-gpu==1.8.0

I have tried limit GPU memory usage, but it can't be a problem here because during training it eats only 1 GB of gpu memory:

from keras.backend.tensorflow_backend 
import set_session config = tf.ConfigProto() 

config.gpu_options.per_process_gpu_memory_fraction = 0.9 

config.gpu_options.allow_growth = True set_session(tf.Session(config=config))

What is wrong here? How can I fix the problem?

I had this exact problem. The computer died after about 15 minutes of training. I found that it was a memory SIMM card that died when it got warm / hot. If you have more than one SIMM card, you can take one out at a time and see if it is the culprit.

  • Please remove cpu version of tensorflow==1.0.1 first. Try installing the tensorflow-gpu==1.8.0 by building TensorFlow from sources as mentioned here

or

  • Replace LSTM with CuDNNLSTM while training model on GPU. Later load the trained model weights into same model architecture with LSTM layer to use the model on CPU. (Make sure to use recurrent_activation='sigmoid' in LSTM layer when re-loading CuDNNLSTM model weights!)

This is some kind of weird for me but problem was related with my new just april 2018 released CPU from AMD. So having up to date linux kernel was crucial: following this guide https://itsfoss.com/upgrade-linux-kernel-ubuntu/ I updated kernel from 4.13 to 4.17 - now everything works

UPD: The motherboard was crashing the system as well, I have changed it - now everythings works well

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM