[英]How to add additional files in Sagemaker Pipeline Processing Step
I want to have additional files which can be imported in preprocess.py file我想要可以在 preprocess.py 文件中导入的其他文件
but i am not able to import these directly.但我无法直接导入这些。
My directory looks like this:我的目录如下所示:
Want to import from helper_functions
directory into preprocess.想要从
helper_functions
目录导入到预处理中。
I tried to add this in setup.py file but it did not work.我试图将其添加到 setup.py 文件中,但没有成功。
package_data={"pipelines.ha_forecast.helper_functions": ["*.py"]},
One thing which kind of worked was to add this folder in input like this:一种有效的方法是在输入中添加这个文件夹,如下所示:
inputs = [
ProcessingInput(source=f'{project_name}/{module_name}/helper_functions',
destination="/opt/ml/processing/input/code/helper_functions"),
]
But this was putting the required files in some other directory which I was not able to import anymore.但这是将所需的文件放在我无法再导入的其他目录中。
What is standard way of doing this?这样做的标准方法是什么?
You have to specify the source_dir
.您必须指定
source_dir
。 Within your script then you can import the modules as you normally do.在您的脚本中,您可以像往常一样导入模块。
source_dir (str or PipelineVariable) – Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None).
source_dir (str or PipelineVariable) – 一个目录的路径(绝对、相对或 S3 URI),除了入口点文件(默认值:无)之外,还有任何其他训练源代码依赖项。 If source_dir is an S3 URI, it must point to a tar.gz file.
如果 source_dir 是一个 S3 URI,它必须指向一个 tar.gz 文件。 Structure within this directory are preserved when training on Amazon SageMaker.
在 Amazon SageMaker 上训练时会保留此目录中的结构。
Look at the documentation in general for Processing (you have to use FrameworkProcessor and not the specific ones like SKLearnProcessor).查看Processing 的一般文档(您必须使用FrameworkProcessor ,而不是像 SKLearnProcessor 这样的特定文档)。
PS: The answer is similar to that of the question " How to install additional packages in sagemaker pipeline ". PS:答案类似于“ How to install additional packages in sagemaker pipeline ”这个问题。
Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and also eventually the requirements.txt file.在指定的文件夹中,必须有脚本(在您的例子中为 preprocess.py)、可能需要的任何其他文件/模块,最后还有 requirements.txt 文件。
The structure of the folder then will be:该文件夹的结构将是:
BASE_DIR/
|─ helper_functions/
| |─ your_utils.py
|─ requirements.txt
|─ preprocess.py
Within your preprocess.py, you will call the scripts in a simple way with:在您的 preprocess.py 中,您将以简单的方式调用脚本:
from helper_functions.your_utils import your_class, your_func
So, your code becomes:所以,你的代码变成:
from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
BASE_DIR = your_script_dir_path
sklearn_processor = FrameworkProcessor(
estimator_cls=SKLearn,
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=base_job_name,
sagemaker_session=pipeline_session,
role=role
)
step_args = sklearn_processor.run(
inputs=[your_inputs],
outputs=[your_outputs],
code="preprocess.py",
source_dir=BASE_DIR,
arguments=[your_arguments],
)
step_process = ProcessingStep(
name="ProcessingName",
step_args=step_args
)
It's a good practice to keep the folders for the various steps separate for each and don't create overlaps.最好将各个步骤的文件夹分开存放,不要重叠。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.