簡體   English   中英

Keras中如何正確分配memory【內存分配錯誤】

[英]How to allocate memory properly in Keras [Memory allocation error]

所以我試圖解決 Kaggle 的黑色素瘤比賽,當我嘗試運行一個簡單的 keras conv model 時,我不斷收到這個錯誤:

Resource exhausted: OOM when allocating tensor with shape[20,128,1022,1022] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

最初我試圖使用 33k 圖像和大約 6 層(我幾乎不知道會有這種錯誤)。 然后我說好吧,因為所有這些圖像都是 1024x1024,我要降低層和內部單元以便於計算,但問題仍然存在,我什至無法通過第一個 epoch。

然后我說好的,我要創建一個新目錄,其中只有 600 個用於訓練的圖像和 200 個用於驗證的圖像(這個問題怎么會持續存在呢??)。 好吧,它繼續,我意識到問題可能出在我的電腦配置上。 我有 Ubuntu 20 並且我檢查了我的 GPU 正在被使用,實際上每次我在開始終端運行代碼時都會說:

Using TensorFlow backend.
2020-06-03 13:48:35.960461: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-03 13:48:35.994757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-03 13:48:35.995140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.56GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 119.24GiB/s

我不知道我是否配置了錯誤,在使用 10k + 圖像進行數字識別器和貓狗(那些圖像是 28*28 和 256*256)之類的簡單代碼時我沒有問題...

到目前為止,我的代碼看起來像這樣(在更改 model 以便於計算之后):

from keras.layers import Input, Dense, Conv2D, Flatten, MaxPool2D
model = models.Model()
// Capas

input_layer = Input(shape = (1024,1024,3),dtype = 'float32')
conv1 = Conv2D(128,(3,3),activation = 'relu')(input_layer)
maxpool1 = MaxPool2D((2,2))(conv1)
conv2 = Conv2D(128,(3,3),activation = 'relu',dtype = 'float32')(maxpool1)
// All layers below until next comment are commented for taking load from cpu
maxpool2 = MaxPool2D((2,2))(conv2)
conv3 = Conv2D(128,(3,3),activation = 'relu',dtype = 'float32')(maxpool2)
maxpool3 = MaxPool2D((2,2))(conv3)
conv4 = Conv2D(256,(3,3),activation = 'relu',dtype = 'float32')(maxpool3)
maxpool4 = MaxPool2D((2,2))(conv4)
conv5 = Conv2D(256,(3,3),activation = 'relu',dtype = 'float32')(maxpool4)
maxpool5 = MaxPool2D((2,2))(conv5)
conv6 = Conv2D(256,(3,3),activation = 'relu',dtype = 'float32')(maxpool5)
// Here stops the commenting of lines
flatten = Flatten()(conv2)
dense1 = Dense(64,activation = 'relu')(flatten)
output_layer = Dense(1, activation='sigmoid')(dense1)

 Generating model

model = models.Model(inputs = input_layer, outputs = output_layer)

from keras import optimizers

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

from keras.preprocessing.image import ImageDataGenerator

 All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(1024, 1024),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        valid_dir,
        target_size=(1024, 1024),
        batch_size=20,
        class_mode='binary')


history = model.fit_generator(
      train_generator,
      steps_per_epoch=30,
      epochs=30,
      validation_data=validation_generator,
      validation_steps=10)

歡迎任何想法或建議,非常感謝您抽出寶貴的時間!

您可以嘗試使用 TPU https://www.kaggle.com/product-feedback/129828

# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = tf.keras.Sequential( … ) # define your model normally
    model.compile( … )

# train model normally
model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)

但是 kaggle 的尺寸 1024x1024 太大了。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM