简体   繁体   English

如何在 Sagemaker 管道处理步骤中添加其他文件

[英]How to add additional files in Sagemaker Pipeline Processing Step

I want to have additional files which can be imported in preprocess.py file我想要可以在 preprocess.py 文件中导入的其他文件
but i am not able to import these directly.但我无法直接导入这些。

My directory looks like this:我的目录如下所示: 截屏

Want to import from helper_functions directory into preprocess.想要从helper_functions目录导入到预处理中。

I tried to add this in setup.py file but it did not work.我试图将其添加到 setup.py 文件中,但没有成功。

package_data={"pipelines.ha_forecast.helper_functions": ["*.py"]},

One thing which kind of worked was to add this folder in input like this:一种有效的方法是在输入中添加这个文件夹,如下所示:

inputs = [
ProcessingInput(source=f'{project_name}/{module_name}/helper_functions',
destination="/opt/ml/processing/input/code/helper_functions"),
]

But this was putting the required files in some other directory which I was not able to import anymore.但这是将所需的文件放在我无法再导入的其他目录中。

What is standard way of doing this?这样做的标准方法是什么?

You have to specify the source_dir .您必须指定source_dir Within your script then you can import the modules as you normally do.在您的脚本中,您可以像往常一样导入模块。

source_dir (str or PipelineVariable) – Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None). source_dir (str or PipelineVariable) – 一个目录的路径(绝对、相对或 S3 URI),除了入口点文件(默认值:无)之外,还有任何其他训练源代码依赖项。 If source_dir is an S3 URI, it must point to a tar.gz file.如果 source_dir 是一个 S3 URI,它必须指向一个 tar.gz 文件。 Structure within this directory are preserved when training on Amazon SageMaker.在 Amazon SageMaker 上训练时会保留此目录中的结构。

Look at the documentation in general for Processing (you have to use FrameworkProcessor and not the specific ones like SKLearnProcessor).查看Processing 的一般文档(您必须使用FrameworkProcessor ,而不是像 SKLearnProcessor 这样的特定文档)。

PS: The answer is similar to that of the question " How to install additional packages in sagemaker pipeline ". PS:答案类似于“ How to install additional packages in sagemaker pipeline ”这个问题。

Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and also eventually the requirements.txt file.在指定的文件夹中,必须有脚本(在您的例子中为 preprocess.py)、可能需要的任何其他文件/模块,最后还有 requirements.txt 文件。

The structure of the folder then will be:该文件夹的结构将是:

BASE_DIR/
|─ helper_functions/
|  |─ your_utils.py
|─ requirements.txt
|─ preprocess.py

Within your preprocess.py, you will call the scripts in a simple way with:在您的 preprocess.py 中,您将以简单的方式调用脚本:

from helper_functions.your_utils import your_class, your_func

So, your code becomes:所以,你的代码变成:

from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput

BASE_DIR = your_script_dir_path

sklearn_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name=base_job_name,
    sagemaker_session=pipeline_session,
    role=role
)

step_args = sklearn_processor.run(
    inputs=[your_inputs],
    outputs=[your_outputs],
    code="preprocess.py",
    source_dir=BASE_DIR,
    arguments=[your_arguments],
)

step_process = ProcessingStep(
    name="ProcessingName",
    step_args=step_args
)

It's a good practice to keep the folders for the various steps separate for each and don't create overlaps.最好将各个步骤的文件夹分开存放,不要重叠。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM