[英]What is the equivalent Glue Spark configuration to use SageMaker processing?
I'm trying to migrate Glue custom PySpark jobs to SageMaker processing to benefit from MLOps provided by SageMaker Pipeline.我正在尝试将 Glue 自定义 PySpark 作业迁移到 SageMaker 处理,以从 SageMaker Pipeline 提供的 MLOps 中受益。
Glue doesn't expose spark-submit
parameters (eg, --conf spark.executor.memory) on their dashboard, so how can I check if my SageMaker processing job uses the same config as the Glue jobs, and what is the best practice to keep their spark configuration to be the same? Glue 不会在其仪表板上公开
spark-submit
参数(例如,--conf spark.executor.memory),因此我如何检查我的 SageMaker 处理作业是否使用与 Glue 作业相同的配置,以及最佳实践是什么保持他们的火花配置相同?
There is a specific component that allows you to do Data Processing with Apache Spark in Amazon SageMaker.有一个特定的组件允许您在 Amazon SageMaker 中使用 Apache Spark 进行数据处理。
It is called PySparkProcessor .它称为PySparkProcessor 。
It works like any other Processing Job.它像任何其他处理作业一样工作。 You can also, of course, specify your run args .
当然,您也可以指定运行参数。
An example for specify memory config:指定 memory 配置的示例:
from sagemaker.spark.processing import PySparkProcessor
spark_processor = PySparkProcessor(
base_job_name="spark-preprocessor",
framework_version="2.4",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)
configuration=[
{
"Classification": "spark-defaults",
"Properties": {
"spark.driver.memory": "4g"
},
}
]
spark_processor.run(
submit_app="preprocess.py",
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', output_prefix],
configuration = configuration
)
You can display the PySpark Glue config with this piece of code:您可以使用这段代码显示 PySpark Glue 配置:
configurations = spark.sparkContext.getConf().getAll()
for item in configurations: print(item)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.