使用 SageMaker 处理的等效 Glue Spark 配置是什么？

Question

我正在尝试将 Glue 自定义 PySpark 作业迁移到 SageMaker 处理，以从 SageMaker Pipeline 提供的 MLOps 中受益。

在 Glue 中，我的工作使用 10 个 G.1X（4 个 CPU，16G 内存）实例并在 10 分钟内完成。
我尝试使用类似的 SageMaker 处理实例（10 ml.m5.xlarge instances with 4 CPUs, 16G memory for each），但失败了，因为 OOM “OutOfMemoryError: Please use an instance type with more memory, or ensure that your processing container does not使用比可用更多的 memory。” 当我检查 cloudwatch 实例指标时，所有 10 个实例的最大 memory 使用率仅为 37.4%，因此实际上并没有用完所有 memory。

Glue 不会在其仪表板上公开spark-submit参数（例如，--conf spark.executor.memory），因此我如何检查我的 SageMaker 处理作业是否使用与 Glue 作业相同的配置，以及最佳实践是什么保持他们的火花配置相同？

Answer 1

有一个特定的组件允许您在 Amazon SageMaker 中使用 Apache Spark 进行数据处理。

它称为PySparkProcessor 。

它像任何其他处理作业一样工作。 当然，您也可以指定运行参数。

指定 memory 配置的示例：

from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    base_job_name="spark-preprocessor",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
)

configuration=[
    {
        "Classification": "spark-defaults",
        "Properties": {
            "spark.driver.memory": "4g"
       },
   }
]

spark_processor.run(
    submit_app="preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix,
               's3_output_bucket', bucket,
               's3_output_key_prefix', output_prefix],
    configuration = configuration
)

Answer 2

您可以使用这段代码显示 PySpark Glue 配置：

configurations = spark.sparkContext.getConf().getAll()
for item in configurations: print(item)

使用 SageMaker 处理的等效 Glue Spark 配置是什么？

问题描述

2 个解决方案

解决方案1
1 2022-10-25 04:47:39

解决方案2
0 2023-01-18 14:00:15

使用 SageMaker 处理的等效 Glue Spark 配置是什么？

问题描述

2 个解决方案

解决方案1 1 2022-10-25 04:47:39

解决方案2 0 2023-01-18 14:00:15

解决方案1
1 2022-10-25 04:47:39

解决方案2
0 2023-01-18 14:00:15