简体   繁体   中英

Tensorflow Parameter Servers on SageMaker

I am trying to understand how parameters servers (PS's) work for distributed training in Tensorflow on Amazon SageMaker.

To make things more concrete, I am able to run the example from AWS using PS's: https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-distribution-options/tf-distributed-training.ipynb

Here is the code block that initializes the estimator for Tensorflow:

from sagemaker.tensorflow import TensorFlow

git_config = {'repo': 'https://github.com/aws-samples/amazon-sagemaker-script-mode', 'branch': 'master'}

ps_instance_type = 'ml.p3.2xlarge'
ps_instance_count = 2

model_dir = "/opt/ml/model"

distributions = {'parameter_server': {
                    'enabled': True}
                }
hyperparameters = {'epochs': 60, 'batch-size' : 256}

estimator_ps = TensorFlow(
                       git_config=git_config,
                       source_dir='tf-distribution-options/code',
                       entry_point='train_ps.py', 
                       base_job_name='ps-cifar10-tf',
                       role=role,
                       framework_version='1.13',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=ps_instance_count, 
                       train_instance_type=ps_instance_type,
                       model_dir=model_dir,
                       tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'dist'}],
                       distributions=distributions)

Going through the documentation for Tensorflow, it seems that a device scope can be used for assigning a variable to a particular worker. However, I never see this done when running training jobs on SageMaker. In the example from AWS, the model is defined by:

https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-distribution-options/code/model_def.py

Here is a snippet:

def get_model(learning_rate, weight_decay, optimizer, momentum, size, mpi=False, hvd=False):

    model = Sequential()
    model.add(Conv2D(32, (3, 3), padding='same', input_shape=(HEIGHT, WIDTH, DEPTH)))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))

    ...

    model.add(Flatten())
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(NUM_CLASSES))
    model.add(Activation('softmax'))

    if mpi:
        size = hvd.size()

    if optimizer.lower() == 'sgd':
        ...

    if mpi:
        opt = hvd.DistributedOptimizer(opt)

    model.compile(loss='categorical_crossentropy',
                  optimizer=opt,
                  metrics=['accuracy'])

    return model

Here, there are no references to distribution strategies (except with MPI, but that flag is set to False for PS's). Somehow, Tensorflow or the SageMaker container is able to decide where the variables for each layer should be stored. However, I'm not seeing anything in the container code that does anything with the distribution strategy.

I am able to run this code and train the model using 1 and 2 instances. When i do so, I see a decrease of almost 50% in the runtime, suggesting that a distributed training is occurring.

My questions are:

  1. How does Tensorflow decide the distribution of variables on the PS's? In the example code, there is no explicit reference to devices. Somehow the distribution is done automatically.
  2. Is it possible to see which parameters have been assigned to each PS? Or to see what the communication between PS's looks like? If so, how?

My questions are:

How does Tensorflow decide the distribution of variables on the PS's? In the example code, there is no explicit reference to devices. Somehow the distribution is done automatically.

The TensorFlow image provided by SageMaker has the code to setup TF_CONFIG and launching parameter server for multi work training. See the code [here][1] The setup is for each node in the cluster there is a PS and a worker thread configured.

It's not using any DistributionStrategy so the default strategy is used. See [here][2].

If you would like to use a different DistributionStrategy or different TF_CONFIG you will need to disable parameter_server option when launching the SageMaker training job and set everything up in your training script.

Is it possible to see which parameters have been assigned to each PS? Or to see what the communication between PS's looks like? If so, how?

You should be able to get some information from the output log which can be found in CloudWatch. The link is available on the Training Job console page. [1]: https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/master/src/sagemaker_tensorflow_container/training.py#L37 [2]: https://www.tensorflow.org/guide/distributed_training#default_strategy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM