使用 SageMaker 处理的等效 Glue Spark 配置是什么？

Question

I'm trying to migrate Glue custom PySpark jobs to SageMaker processing to benefit from MLOps provided by SageMaker Pipeline.我正在尝试将 Glue 自定义 PySpark 作业迁移到 SageMaker 处理，以从 SageMaker Pipeline 提供的 MLOps 中受益。

In Glue, my job uses 10 G.1X (4 CPUs, 16G memory) instances and completed in 10min.在 Glue 中，我的工作使用 10 个 G.1X（4 个 CPU，16G 内存）实例并在 10 分钟内完成。
I tried to use similar SageMaker processing instances (10 ml.m5.xlarge instances with 4 CPUs, 16G memory for each), but failed because OOM "OutOfMemoryError: Please use an instance type with more memory, or ensure that your processing container does not use more memory than available."我尝试使用类似的 SageMaker 处理实例（10 ml.m5.xlarge instances with 4 CPUs, 16G memory for each），但失败了，因为 OOM “OutOfMemoryError: Please use an instance type with more memory, or ensure that your processing container does not使用比可用更多的 memory。” When I checked the cloudwatch instance metrics, the maximum memory usage across all 10 instances are just 37.4%, so actually not used up all memory.当我检查 cloudwatch 实例指标时，所有 10 个实例的最大 memory 使用率仅为 37.4%，因此实际上并没有用完所有 memory。

Glue doesn't expose spark-submit parameters (eg, --conf spark.executor.memory) on their dashboard, so how can I check if my SageMaker processing job uses the same config as the Glue jobs, and what is the best practice to keep their spark configuration to be the same? Glue 不会在其仪表板上公开spark-submit参数（例如，--conf spark.executor.memory），因此我如何检查我的 SageMaker 处理作业是否使用与 Glue 作业相同的配置，以及最佳实践是什么保持他们的火花配置相同？

Answer 1

There is a specific component that allows you to do Data Processing with Apache Spark in Amazon SageMaker.有一个特定的组件允许您在 Amazon SageMaker 中使用 Apache Spark 进行数据处理。

It is called PySparkProcessor .它称为PySparkProcessor 。

It works like any other Processing Job.它像任何其他处理作业一样工作。 You can also, of course, specify your run args .当然，您也可以指定运行参数。

An example for specify memory config:指定 memory 配置的示例：

from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    base_job_name="spark-preprocessor",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
)

configuration=[
    {
        "Classification": "spark-defaults",
        "Properties": {
            "spark.driver.memory": "4g"
       },
   }
]

spark_processor.run(
    submit_app="preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix,
               's3_output_bucket', bucket,
               's3_output_key_prefix', output_prefix],
    configuration = configuration
)

Answer 2

You can display the PySpark Glue config with this piece of code:您可以使用这段代码显示 PySpark Glue 配置：

configurations = spark.sparkContext.getConf().getAll()
for item in configurations: print(item)

使用 SageMaker 处理的等效 Glue Spark 配置是什么？

问题描述

2 个解决方案

解决方案1
1 2022-10-25 04:47:39

解决方案2
0 2023-01-18 14:00:15

使用 SageMaker 处理的等效 Glue Spark 配置是什么？

问题描述

2 个解决方案

解决方案1 1 2022-10-25 04:47:39

解决方案2 0 2023-01-18 14:00:15

解决方案1
1 2022-10-25 04:47:39

解决方案2
0 2023-01-18 14:00:15