简体   繁体   English

Sagemaker 使用 PySpark 和 Step Functions 处理作业

[英]Sagemaker processing job with PySpark and Step Functions

this is my problem: I have to run a Sagemaker processing job using custom code written in PySpark.这是我的问题:我必须使用用 PySpark 编写的自定义代码运行 Sagemaker 处理作业。 I've used the Sagemaker SDK by running these commands:我通过运行以下命令使用了 Sagemaker SDK:

spark_processor = sagemaker.spark.processing.PySparkProcessor(
        base_job_name="spark-preprocessor",
        framework_version="2.4",
        role=role_arn,
        instance_count=2,
        instance_type="ml.m5.xlarge",
        max_runtime_in_seconds=1800,
    )

    spark_processor.run(
        submit_app="processing.py",
        arguments=['s3_input_bucket', bucket_name,
                   's3_input_file_path', file_path
                   ]
    )

Now I have to automate the workflow by using Step Functions.现在我必须使用 Step Functions 来自动化工作流程。 For this purpose, I've written a lambda function to do that but I receive the following error:为此,我编写了 lambda function 来执行此操作,但我收到以下错误:

{
  "errorMessage": "Unable to import module 'lambda_function': No module named 'sagemaker'",
  "errorType": "Runtime.ImportModuleError"
}

This is my lambda function:这是我的 lambda function:

import sagemaker

def lambda_handler(event, context):
    spark_processor = sagemaker.spark.processing.PySparkProcessor(
        base_job_name="spark-preprocessor",
        framework_version="2.4",
        role=role_arn,
        instance_count=2,
        instance_type="ml.m5.xlarge",
        max_runtime_in_seconds=1800,
    )

    spark_processor.run(
        submit_app="processing.py",
        arguments=['s3_input_bucket', event["bucket_name"],
                   's3_input_file_path', event["file_path"]
                   ]
    )

My question is: How can I create a step in my state machine for running a PySpark code using Sagemaker processing?我的问题是:如何在我的 state 机器中创建一个步骤,以使用 Sagemaker 处理运行 PySpark 代码?

Thank you谢谢

sagemaker sdk is not installed by default in the lambda container environment: you should include it in the lambda zip that you upload to s3. sagemaker sdk is not installed by default in the lambda container environment: you should include it in the lambda zip that you upload to s3.

There are various ways to do this, one of the easiest is to deploy your lambda with Serverless Application Model (SAM) cli .有多种方法可以做到这一点,最简单的方法之一是使用无服务器应用程序 Model (SAM) cli部署 lambda。 In this case it might be enough to place sagemaker in the requirements.txt placed in the folder that contains your lambda code, and SAM will ensure that the dependency is included in the zip.在这种情况下,将sagemaker放在包含 lambda 代码的文件夹中的requirements.txt中可能就足够了,SAM 将确保依赖项包含在 zip 中。

Alternatively you can create the zip manually with pip install sagemaker -t lambda_folder but you should execute this command in an Amazon Linux OS, for example with an EC2 with the appropriate image or in a Docker container. Alternatively you can create the zip manually with pip install sagemaker -t lambda_folder but you should execute this command in an Amazon Linux OS, for example with an EC2 with the appropriate image or in a Docker container. Search for "python dependencies in aws lambda" for more info.搜索“aws lambda 中的 python 依赖项”以获取更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM