简体   繁体   中英

Training keras model in AWS Sagemaker

I have keras training script on my machine. I am experimenting to run my script on AWS sagemaker container. For that I have used below code.

from sagemaker.tensorflow import TensorFlow
est = TensorFlow(
    entry_point="caller.py",
    source_dir="./",
    role='role_arn',
    framework_version="2.3.1",
    py_version="py37",
    instance_type='ml.m5.large',
    instance_count=1,
    hyperparameters={'batch': 8, 'epochs': 10},
)

est.fit()

here caller.py is my entry point. After executing the above code I am getting keras is not installed . Here is the stacktrace.

Traceback (most recent call last):
  File "executor.py", line 14, in <module>
    est.fit()
  File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 682, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/estimator.py", line 1625, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3681, in logs_for_job
    self._check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/thasin/Documents/python/venv/lib/python3.8/site-packages/sagemaker/session.py", line 3240, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job tensorflow-training-2021-06-09-07-14-01-778: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/local/bin/python3.7 caller.py --batch 4 --epochs 10

ModuleNotFoundError: No module named 'keras'

  1. Which instance has pre-installed keras?
  2. Is there any way I can install the python package to the AWS container? or any workaround for the issue?

Note: I have tried with my own container uploading to ECR and successfully run my code. I am looking for AWS's existing container capability.

Keras is now part of tensorflow, so you can just reformat your code to use tf.keras instead of keras . Since version 2.3.0 of tensorflow they are in sync, so it should not be that difficult. You container is this , as you can see from the list of the packages, there is no Keras . If you instead want to extend a pre-built container you can take a look here but I don't recommend in this specific use-case, because also for future code maintainability you should go for tf.keras

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM