簡體   English   中英

如何使用 TensorFlow Estimator 在 Amazon SageMaker 中獲得可重現的結果?

[英]How to get reproducible result in Amazon SageMaker with TensorFlow Estimator?

I am currently using AWS SageMaker Python SDK to train EfficientNet model ( https://github.com/qubvel/efficientnet ) to my data. 具體來說,我使用 TensorFlow 估計器如下。 此代碼在 SageMaker 筆記本實例中

import sagemaker
from sagemaker.tensorflow.estimator import TensorFlow
### sagemaker version = 1.50.17, python version = 3.6

estimator = TensorFlow("train.py", py_version = "py3", framework_version = "2.1.0",
                       role = sagemaker.get_execution_role(), 
                       train_instance_type = "ml.m5.xlarge", 
                       train_instance_count = 1,
                       image_name = 'xxx.dkr.ecr.xxx.amazonaws.com/xxx',
                       hyperparameters = {list of hyperparameters here: epochs, batch size},
                       subnets = [xxx], 
                       security_group_ids = [xxx]
estimator.fit({
   'class_1': 's3_path_class_1',
   'class_2': 's3_path_class_2'
})

train.py 的代碼包含通常的訓練過程,從 S3 獲取圖像和標簽,將它們轉換為 EfficientNet 輸入的正確數組形狀,然后拆分為訓練集、驗證集和測試集。 In order to get reproducible result, I use the following reset_random_seeds function ( If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters? ) before calling EfficientNet model itself.

### code of train.py

import os
os.environ['PYTHONHASHSEED']=str(1)
import numpy as np
import tensorflow as tf
import efficientnet.tfkeras as efn
import random

### tensorflow version = 2.1.0
### tf.keras version = 2.2.4-tf
### efficientnet version = 1.1.0

def reset_random_seeds():
   os.environ['PYTHONHASHSEED']=str(1)
   tf.random.set_seed(1)
   np.random.seed(1)
   random.seed(1)

if __name__ == "__main__":

   ### code for getting training data
   ### ... (I have made sure that the training input is the same every time i re-run the code)
   ### end of code

   reset_random_seeds()
   model = efn.EfficientNetB5(include_top = False, 
      weights = 'imagenet', 
      input_shape = (80, 80, 3),
      pooling = 'avg',
      classes = 3)
   model.compile(optimizer = 'Adam', loss = 'categorical_crossentropy')
   model.fit(X_train, Y_train, batch_size = 64, epochs = 30, shuffle = True, verbose = 2)

   ### Prediction section here

但是,每次我運行筆記本實例時,我總是得到與上次運行不同的結果。 當我將 train_instance_type 切換為“local”時,每次運行筆記本時總是得到相同的結果。 因此,是否是我選擇的訓練實例類型導致的不可重現的結果? 此實例 (ml.m5.xlarge) 有 4 個 vCPU、16 個內存 (GiB),並且沒有 GPU。 如果是這樣,如何在這個訓練實例下獲得可重現的結果?

您的不一致結果是否可能來自

tf.random.set_seed()

在這里看到一個帖子: Tensorflow:使用相同隨機種子的不同結果

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM