简体   繁体   English

带有 tensorflow-gpu 的 Keras 完全冻结了 PC

[英]Keras with tensorflow-gpu totally freezes PC

I have pretty simple architecture lstm NN.我有非常简单的架构 lstm NN。 After few epoch 1-2 my PC totally freezes I can't even move my mouse :在几个 epoch 1-2 之后,我的电脑完全死机,我什至无法移动鼠标:

Layer (type)                 Output Shape              Param #   
=================================================================
lstm_4 (LSTM)                (None, 128)               116224    
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 98)                12642     
=================================================================
Total params: 128,866
Trainable params: 128,866
Non-trainable params: 0

    # Same problem  with 2 layers LSTM  with dropout and Adam optimizer

    SEQUENCE_LENGTH =3, len(chars) = 98
    model = Sequential()
    model.add(LSTM(128, input_shape = (SEQUENCE_LENGTH, len(chars))))
    #model.add(Dropout(0.15))
    #model.add(LSTM(128))
    model.add(Dropout(0.10))
    model.add(Dense(len(chars), activation = 'softmax'))

    model.compile(loss = 'categorical_crossentropy', optimizer = RMSprop(lr=0.01), metrics=['accuracy'])

This is how I train:我是这样训练的:

history = model.fit(X, y, validation_split=0.20, batch_size=128, epochs=10, shuffle=True,verbose=2).history

NN needs 5 minutes to finish 1 epoch. NN 需要 5 分钟才能完成 1 个 epoch。 Higher size of batch doesn't mean that problem will occur faster.更大的批量并不意味着问题会更快地发生。 But more complex model can train more time achieving almost same accuracy - about 0.46 (full code here )但是更复杂的模型可以训练更多的时间来达到几乎相同的准确度 - 大约 0.46(完整代码在这里

I have last up to date Linux Mint, 1070ti with 8GB, 32Gb ram我有最新的 Linux Mint, 1070ti 8GB, 32Gb ram

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:08:00.0 On | N/A |
| 0% 35C P8 10W / 180W | 303MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Libraries:图书馆:

Keras==2.2.0
Keras-Applications==1.0.2
Keras-Preprocessing==1.0.1
keras-sequential-ascii==0.1.1
keras-tqdm==2.0.1
tensorboard==1.8.0
tensorflow==1.0.1
tensorflow-gpu==1.8.0

I have tried limit GPU memory usage, but it can't be a problem here because during training it eats only 1 GB of gpu memory:我曾尝试限制 GPU 内存使用,但这里不会有问题,因为在训练期间它只吃 1 GB 的 GPU 内存:

from keras.backend.tensorflow_backend 
import set_session config = tf.ConfigProto() 

config.gpu_options.per_process_gpu_memory_fraction = 0.9 

config.gpu_options.allow_growth = True set_session(tf.Session(config=config))

What is wrong here?这里有什么问题? How can I fix the problem?我该如何解决这个问题?

I had this exact problem.我有这个确切的问题。 The computer died after about 15 minutes of training.计算机在训练大约 15 分钟后就死机了。 I found that it was a memory SIMM card that died when it got warm / hot.我发现它是一个内存 SIMM 卡,当它变热/变热时就死了。 If you have more than one SIMM card, you can take one out at a time and see if it is the culprit.如果你有不止一张SIMM卡,你可以一次取出一张,看看是不是罪魁祸首。

  • Please remove cpu version of tensorflow==1.0.1 first.请先移除tensorflow==1.0.1 cpu 版本。 Try installing the tensorflow-gpu==1.8.0 by building TensorFlow from sources as mentioned here尝试通过从此处提到的来源构建 TensorFlow 来安装tensorflow-gpu==1.8.0

or或者

  • Replace LSTM with CuDNNLSTM while training model on GPU.在 GPU 上训练模型时用CuDNNLSTM替换LSTM Later load the trained model weights into same model architecture with LSTM layer to use the model on CPU.稍后将训练好的模型权重加载到与 LSTM 层相同的模型架构中,以在 CPU 上使用该模型。 (Make sure to use recurrent_activation='sigmoid' in LSTM layer when re-loading CuDNNLSTM model weights!) (确保在重新加载 CuDNNLSTM 模型权重时在 LSTM 层中使用recurrent_activation='sigmoid' !)

This is some kind of weird for me but problem was related with my new just april 2018 released CPU from AMD.这对我来说有点奇怪,但问题与我 2018 年 4 月刚刚从 AMD 发布的新 CPU 相关。 So having up to date linux kernel was crucial: following this guide https://itsfoss.com/upgrade-linux-kernel-ubuntu/ I updated kernel from 4.13 to 4.17 - now everything works因此,拥有最新的 linux 内核至关重要:按照本指南https://itsfoss.com/upgrade-linux-kernel-ubuntu/我将内核从 4.13 更新到 4.17 - 现在一切正常

UPD: The motherboard was crashing the system as well, I have changed it - now everythings works well UPD:主板也让系统崩溃,我已经改变了 - 现在一切正常

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM