简体   繁体   English

TensorFlow1.15,多GPU-1-machine,batch_size怎么设置?

[英]TensorFlow1.15, multi-GPU-1-machine, how to set batch_size?

I am pretraining BERT in 1 machine with 4 GPU.我在一台机器上用 4 个 GPU 预训练 BERT。

The input function code:输入 function 代码:

    def input_fn(params):
        """The actual input function."""
        batch_size = FLAGS.train_batch_size

        name_to_features = {
            "input_ids":
                tf.FixedLenFeature([max_seq_length], tf.int64),
            "input_mask":
                tf.FixedLenFeature([max_seq_length], tf.int64),
            "segment_ids":
                tf.FixedLenFeature([max_seq_length], tf.int64),
            "masked_lm_positions":
                tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
            "masked_lm_ids":
                tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
            "masked_lm_weights":
                tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
            "next_sentence_labels":
                tf.FixedLenFeature([1], tf.int64),
        }

        # For training, we want a lot of parallel reading and shuffling.
        # For eval, we want no shuffling and parallel reading doesn't matter.
        if is_training:
            d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
            d = d.repeat()
            d = d.shuffle(buffer_size=len(input_files))

            # `cycle_length` is the number of parallel files that get read.
            cycle_length = min(num_cpu_threads, len(input_files))

            # `sloppy` mode means that the interleaving is not exact. This adds
            # even more randomness to the training pipeline.
            d = d.apply(
                tf.contrib.data.parallel_interleave(
                    tf.data.TFRecordDataset,
                    sloppy=is_training,
                    cycle_length=cycle_length))
            d = d.shuffle(buffer_size=100)
        else:
            d = tf.data.TFRecordDataset(input_files)
            # Since we evaluate for a fixed number of steps we don't want to encounter
            # out-of-range exceptions.
            d = d.repeat()

        # We must `drop_remainder` on training because the TPU requires fixed
        # size dimensions. For eval, we assume we are evaluating on the CPU or GPU
        # and we *don't* want to drop the remainder, otherwise we wont cover
        # every sample.
        d = d.apply(
            tf.contrib.data.map_and_batch(
                lambda record: _decode_record(record, name_to_features),
                batch_size=batch_size,
                num_parallel_batches=num_cpu_threads,
                drop_remainder=True))
        d = d.prefetch(10)
        return d

The mirrow strategy code:镜像策略代码:

    distribution = tf.contrib.distribute.MirroredStrategy(
        devices=["device:GPU:%d" % i for i in range(FLAGS.n_gpus)],
        # num_gpus=4,
        cross_tower_ops=tf.distribute.HierarchicalCopyAllReduce())
    run_config = RunConfig(
        train_distribute=distribution,
        # eval_distribute=dist_strategy,
        log_step_count_steps=log_every_n_steps,
        model_dir=FLAGS.output_dir,
        save_checkpoints_steps=FLAGS.save_checkpoints_steps)

    model_fn = model_fn_builder(
        bert_config=bert_config,
        init_checkpoint=FLAGS.init_checkpoint,
        learning_rate=FLAGS.learning_rate,
        num_train_steps=FLAGS.num_train_steps,
        num_warmup_steps=FLAGS.num_warmup_steps,
        use_tpu=FLAGS.use_tpu,
        use_one_hot_embeddings=FLAGS.use_tpu)

    # If TPU is not available, this will fall back to normal Estimator on CPU
    # or GPU.
    estimator = Estimator(
        model_fn=model_fn,
        params={},
        config=run_config)

The problem is that I have 4 GPU.问题是我有 4 个 GPU。 Each GPU could run 8 batchsize at most.每个 GPU 最多可以运行 8 个批大小。

I set train_batch_size = 8 not 32. Is OK but I don't know each GPU get different data in one training step.我设置train_batch_size = 8而不是 32。可以,但我不知道每个 GPU 在一个训练步骤中获得不同的数据。

If I set train_batch_size = 32 , it will out of memory (OOM).如果我设置train_batch_size = 32 ,它将超出 memory (OOM)。

Is my code right now?我的代码现在是吗? Will the data be distributed to 4 GPU and each GPU get different data?数据会分配给4个GPU,每个GPU得到不同的数据吗?

According to BERT-GPU .根据BERT-GPU

No problem with the code.代码没有问题。

The batch_size is for 1 GPU. batch_size用于 1 GPU。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM