简体   繁体   English

如何将 fsx 输入到 Amazon Sagemaker?

[英]How to input fsx for lustre to Amazon Sagemaker?

I am trying to set up Amazon sagemaker reading our dataset from our AWS Fsx for Lustre file system.我正在尝试设置 Amazon sagemaker 从我们的 AWS Fsx for Lustre 文件系统读取我们的数据集。

We are using the Sagemaker API, and previously we were reading our dataset from s3 which worked fine:我们使用的是 Sagemaker API,之前我们从 s3 读取数据集,效果很好:

estimator = TensorFlow(
   entry_point='model_script.py',  
   image_uri='719982911.dkr.ecr.eu-west-1.amazonaws.com/project_name/sagemaker-training-mlflow:latest', 
   instance_type='ml.m4.10xlarge',
   instance_count=1,
   role=role,
   framework_version='2.0.0',
   py_version='py3',
   subnets=["subnet-2375a679"],
   security_group_ids=["sg-6cb95013", "sg-0bc6ddb1f102bbdb1"],
   debugger_hook_config=False,
  )
estimator.fit({
    'training': f"s3://bucket_name/data/{hyperparameters['dataset']}/"}
)

But now that I'm changing the input data source to Fsx Lustre file system, I'm getting an error that the file input should be s3:// or file://.但是现在我将输入数据源更改为 Fsx Lustre 文件系统,我收到一个错误,即文件输入应该是 s3:// 或 file://。 I was following these docs (fsx lustre) :我正在关注这些文档(fsx lustre)

estimator = TensorFlow(
   entry_point='model_script.py',  
#    image_uri='71998291.dkr.ecr.eu-west-1.amazonaws.com/bucket_name/sagemaker-training-mlflow:latest', 
   instance_type='ml.m4.10xlarge',
   instance_count=1,
   role=role,
   framework_version='2.0.0',
   py_version='py3',
   subnets=["subnet-2375a679"],
   security_group_ids=["sg-6cb95013", "sg-0bc6ddb1f102bbdb1"],
   debugger_hook_config=False,
  )
fsx_data_folder = FileSystemInput(file_system_id='fs-03a0e6927e5ffc449',
                                    file_system_type='FSxLustre',
                                    directory_path='/fsx/data',
                                    file_system_access_mode='ro')
estimator.fit(f"{fsx_data_folder}/{hyperparameters['dataset']}/")

Throws the following error:引发以下错误:

ValueError: URI input <sagemaker.inputs.FileSystemInput object at 0x0000016A6C7F0788>/dataset_name/ must be a valid S3 or FILE URI: must start with "s3://" or "file://"

Does anyone understand what I am doing wrong?有谁明白我做错了什么? Thanks in advance!提前致谢!

I was (quite stupidly, it was late;)) treating the FileSystemInput object as a string instead of an object.我(非常愚蠢,已经很晚了;))将 FileSystemInput object 视为字符串而不是 object。 The error complained that the concatenation of obj+string is not a valid URI pointing to a location in s3.该错误抱怨 obj+string 的连接不是指向 s3 中某个位置的有效 URI。

The correct way to do it is making a FileSystemInput object out of the entire path to the dataset.正确的方法是在数据集的整个路径中创建一个 FileSystemInput object。 Note that the fit now takes this object, and will mount it to data_dir = "/opt/ml/input/data/training" .请注意, fit现在采用此 object,并将其挂载到data_dir = "/opt/ml/input/data/training"

fsx_data_obj = FileSystemInput(
    file_system_id='fs-03a0e6927e5ffc449',
    file_system_type='FSxLustre',
    directory_path='/fsx/data/{dataset}',
    file_system_access_mode='ro'
)
estimator.fit(fsx_data_obj)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM