简体   繁体   中英

Delay in executing Sagemaker.sklearn.processing.SKLearnProcessor.run job, in Sagemaker

I use Sagemaker's SKLearnProcessor.run for executing my training job. Between the time my processing job starts executing and the time my first line of the code in the processing.py file is read, there is a delay of 4-5 minutes. After the job starts executing, irrespective of how large the input file is, the job completes execution quickly, as is expected from Sagemaker's processing capabilities.

My question is, can I somehow reduce the time it takes to start executing my processing.py file.

sklearn_job.run(code= os.path.join('s3://',bucket, code_prefix, 'preprocessing_v2.py'),

'''

            inputs=[ProcessingInput(
                input_name='raw1',
                source= os.path.join('s3://',bucket, input_prefix, 'file1.csv'),
                destination='/opt/ml/processing/input1'),
                   ProcessingInput(
                input_name='raw2',
                source= os.path.join('s3://',bucket, input_prefix, 'file2.csv'),
                destination='/opt/ml/processing/input2')],
            outputs=[ProcessingOutput(output_name='sample_file',
                                      source='/opt/ml/processing/dataset',
                                      destination=os.path.join('s3://',bucket, output_prefix))],
                  
            arguments=["--train_size", "0.8","--test_size","0.2"],
            wait=True, logs=True,
           )

'''

thanks for posting: You can reduce the job duration by:

  1. Using light docker images (if you use a managed image that's not an option)
  2. Using small datasets to minimize download times

beyond that indeed you will still face a "cold start" of few minutes (with Sklearn Estimator on CPU instances I found it's rarely more than 1-2), for SageMaker to launch and configure the compute cluster.

This "cold start" is a symptom of a good feature of the service, which is the transient nature of compute clusters: every job execution runs on a new EC2 cluster (1 or N machines based on your config). This is good for security, scalability and fault tolerance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM