简体   繁体   中英

What is the equivalent Glue Spark configuration to use SageMaker processing?

I'm trying to migrate Glue custom PySpark jobs to SageMaker processing to benefit from MLOps provided by SageMaker Pipeline.

  1. In Glue, my job uses 10 G.1X (4 CPUs, 16G memory) instances and completed in 10min.
  2. I tried to use similar SageMaker processing instances (10 ml.m5.xlarge instances with 4 CPUs, 16G memory for each), but failed because OOM "OutOfMemoryError: Please use an instance type with more memory, or ensure that your processing container does not use more memory than available." When I checked the cloudwatch instance metrics, the maximum memory usage across all 10 instances are just 37.4%, so actually not used up all memory.

Glue doesn't expose spark-submit parameters (eg, --conf spark.executor.memory) on their dashboard, so how can I check if my SageMaker processing job uses the same config as the Glue jobs, and what is the best practice to keep their spark configuration to be the same?

There is a specific component that allows you to do Data Processing with Apache Spark in Amazon SageMaker.

It is called PySparkProcessor .

It works like any other Processing Job. You can also, of course, specify your run args .


An example for specify memory config:

from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    base_job_name="spark-preprocessor",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
)

configuration=[
    {
        "Classification": "spark-defaults",
        "Properties": {
            "spark.driver.memory": "4g"
       },
   }
]

spark_processor.run(
    submit_app="preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix,
               's3_output_bucket', bucket,
               's3_output_key_prefix', output_prefix],
    configuration = configuration
)

You can display the PySpark Glue config with this piece of code:

configurations = spark.sparkContext.getConf().getAll()
for item in configurations: print(item)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM