Tensorflow 分布式學習不工作，核心被轉儲？

Question

我試圖讓我的機器運行帶有兩個 GPU 的 GRU model，但它一直告訴我核心已轉儲。 我已經安裝了驅動程序和 cuDnn。 我可以看到 tensorflow 可以識別兩個 GPU。 任何人都可以幫助建議該怎么做？ 到目前為止，我已經嘗試設置自定義選項，但我不確定如何解決這個問題。

這是我的 model：

#wrap data in Dataset objects for distributed learning
X_train = tf.data.Dataset.from_tensor_slices((X_train))
y_train = tf.data.Dataset.from_tensor_slices((y_train))

    
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
with strategy.scope():
        #init model
        model = Sequential()

        #utilize gpu
        model.add(GRU(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 
        X_train.shape[2]), activation='tanh', recurrent_activation='sigmoid',))
        model.add(Dropout(.5))
        model.add(GRU(units = 50, return_sequences = True,activation='tanh', 
        recurrent_activation='sigmoid',))
        #output layer
        model.add(layers.Dense(units=1, activation='sigmoid'))
        #compile model
        model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')
    
    
    #fit model
    batch_size = 32
    X_train = X_train.batch(batch_size)
    y_train = y_train.batch(batch_size)

    # Disable AutoShard.
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
    X_train = X_train.with_options(options)
    y_train = y_train.with_options(options)   
    
    model.fit(X_train, y_train, epochs =300)

這是我得到的錯誤：

2021-01-08 00:08:27.507289: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-08 00:08:27.507968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-08 00:08:27.564131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.564900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 980 computeCapability: 5.2
coreClock: 1.367GHz coreCount: 16 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 208.91GiB/s
2021-01-08 00:08:27.565007: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.565748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:06:00.0 name: GeForce GTX 980 computeCapability: 5.2
coreClock: 1.2785GHz coreCount: 16 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 208.91GiB/s
2021-01-08 00:08:27.565804: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-08 00:08:27.567571: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-08 00:08:27.567708: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-08 00:08:27.568535: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-08 00:08:27.568800: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-08 00:08:27.570596: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-08 00:08:27.571225: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-08 00:08:27.571389: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-08 00:08:27.571524: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.572357: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.573130: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.573899: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.574619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2021-01-08 00:08:27.575207: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-08 00:08:27.673899: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.674411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 980 computeCapability: 5.2
coreClock: 1.367GHz coreCount: 16 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 208.91GiB/s
2021-01-08 00:08:27.674549: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.675013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:06:00.0 name: GeForce GTX 980 computeCapability: 5.2
coreClock: 1.2785GHz coreCount: 16 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 208.91GiB/s
2021-01-08 00:08:27.675092: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-08 00:08:27.675135: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-08 00:08:27.675168: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-08 00:08:27.675205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-08 00:08:27.675249: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-08 00:08:27.675271: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-08 00:08:27.675293: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-08 00:08:27.675314: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-08 00:08:27.675417: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.676106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.676787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.677421: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:27.677969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2021-01-08 00:08:27.678077: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-08 00:08:28.411566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-08 00:08:28.411610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1 
2021-01-08 00:08:28.411617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y 
2021-01-08 00:08:28.411621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N 
2021-01-08 00:08:28.411922: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:28.412587: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:28.413095: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:28.413572: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:28.414026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3113 MB memory) -> physical GPU (device: 0, name: GeForce GTX 980, pci bus id: 0000:01:00.0, compute capability: 5.2)
2021-01-08 00:08:28.414449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:28.415017: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-08 00:08:28.415479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 3552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 980, pci bus id: 0000:06:00.0, compute capability: 5.2)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-01-08 00:08:30.314997: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_1224"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
2021-01-08 00:08:30.327035: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-08 00:08:30.347472: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3693415000 Hz
Epoch 1/300
INFO:tensorflow:batch_all_reduce: 17 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 17 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2021-01-08 00:08:43.479925: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-08 00:08:43.658681: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-08 00:08:54.655804: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated
2021-01-08 00:08:54.655885: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
Aborted (core dumped)

Answer 1

這不是解決方案。 只需在下面粘貼 OP 的代碼並進行一些小的修改以實際運行代碼。 對於那些訪問此頁面的人，OP 的代碼使用容器tensorflow/tensorflow:2.9.1-gpu-jupyter運行沒有錯誤。

import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

batch = 32
myshape = (128,128)
X_train = np.random.rand(batch*5,128,128)
y_train = np.random.rand(batch*5,1)

#wrap data in Dataset objects for distributed learning
dataset = tf.data.Dataset.from_tensor_slices((X_train,y_train))
    
strategy = tf.distribute.MirroredStrategy()
# devices=["/gpu:0", "/gpu:1"]

with strategy.scope():
        #init model
        model = keras.Sequential()

        #utilize gpu
        model.add(layers.GRU(units = 50, return_sequences = True, input_shape = myshape, activation='tanh', recurrent_activation='sigmoid',))
        model.add(layers.Dropout(.5))
        model.add(layers.GRU(units = 50, return_sequences = True,activation='tanh', 
        recurrent_activation='sigmoid',))
        #output layer
        model.add(layers.Dense(units=1, activation='sigmoid'))
        #compile model
        model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')

#fit model
dataset = dataset.batch(batch)

# Disable AutoShard.
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
dataset = dataset.with_options(options)

model.fit(dataset, epochs =300)

編輯：

我在使用tf.distribute.MirroredStrategy時也被Aborted (core dumped) ，所以上面的代碼塊是幫助進行鑒別診斷的一個很好的第一步 - 首先確定它不是 Tensorflow 本身，因此它必須是底層庫（不太可能如果您使用的是 docker），或者更有可能是數據集生成器或 model。

事實證明，如果您使用自定義train_step和test_step ，您將需要實現distributed_train_step和distributed_test_step 。 示例代碼塊粘貼在下面，有關更多詳細信息，請參見教程 0和教程1。

def train_step(inputs):
  features, labels = inputs

  with tf.GradientTape() as tape:
    predictions = model(features, training=True)
    loss = compute_loss(labels, predictions)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  return loss

@tf.function
def distributed_train_step(dist_inputs):
  per_replica_losses = mirrored_strategy.run(train_step, args=(dist_inputs,))
  return mirrored_strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,axis=None)

@tf.function
def distributed_test_step(dataset_inputs):
  return strategy.run(test_step, args=(dataset_inputs,))

Tensorflow 分布式學習不工作，核心被轉儲？

問題描述

1 個解決方案

解決方案1
0 2022-07-31 16:58:47

Tensorflow 分布式學習不工作，核心被轉儲？

問題描述

1 個解決方案

解決方案1 0 2022-07-31 16:58:47

解決方案1
0 2022-07-31 16:58:47