I want to run distributed prediction on my GPU cluster using TF 2.0. I trained a CNN made with Keras using MirroredStrategy and saved it. I can load the model and use.predict() on it, but I was wondering if this automatically does distributed prediction using available GPUs. If not, how can I run distributed prediction to speed up inference and use all available GPU memory?
At the moment, when running many large predictions, I exceed the memory (needs 17gb) of one of my GPUs (12gb) and the inferencing fails because it runs out of memory:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.12GiB
but I have multiple GPUs and would like to use their memory as well. Thanks.
I was able to piece together single-worker, multi-GPU prediction as follows (consider it a sketch - it uses plumbing code that's not generally applicable, but should give you a template to go off of):
# https://github.com/tensorflow/tensorflow/issues/37686
# https://www.tensorflow.org/tutorials/distribute/custom_training
def compute_and_write_ious_multi_gpu(path: str, filename_csv: str, include_sampled: bool):
strategy = tf.distribute.MirroredStrategy()
util.log('Number of devices: {}'.format(strategy.num_replicas_in_sync))
(ds, s, n) = dataset(path, shuffle=False, repeat=False, mask_as_input=True)
dist_ds = strategy.experimental_distribute_dataset(ds)
def predict_step(inputs):
images, labels = inputs
return model(images, training=False)
@tf.function
def distributed_predict_step(dataset_inputs):
per_replica_losses = strategy.run(predict_step, args=(dataset_inputs,))
return per_replica_losses # unwrap!?
# https://stackoverflow.com/questions/57549448/how-to-convert-perreplica-to-tensor
def unwrap(per_replica): # -> list of numpy arrays
if strategy.num_replicas_in_sync > 1:
out = per_replica.values
else:
out = (per_replica,)
return list(map(lambda x: x.numpy(), out))
with strategy.scope():
model = wrap_model()
util.log(f'Starting distributed prediction for {filename_csv}')
ious = [unwrap(distributed_predict_step(x)) for x in dist_ds]
t = ious
ious = [item for sublist in t for item in
sublist] # https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
util.log(f'Distributed prediction done for {filename_csv}')
ious = np.concatenate(ious).ravel().tolist()
ious = round_ious(ious)
ious = list(zip(ious, ds.all_image_paths))
ious.sort()
write_ious(ious, filename_csv, include_sampled)
This does distribute the load across the GPUs, but unfortunately makes very poor use of them - in my particular case the corresponding single-GPU code runs in ~12 hours, and this runs in 7.7 hours, so not even a 2x speedup despite have 8x the number of GPUs.
I think it's mostly a data feeding issue, but I don't know how to fix it. Hopefully someone else can provide some better insights?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.