简体繁体 English

哪个用于处理 sagemaker 批量推理管道的数据 - SKlearnEstimator 或 SKlearnProcessor

[英]which one to use to process data for sagemaker batch inferencing pipeline - SKlearnEstimator or SKlearnProcessor

原文 2022-09-25 04:38:25 3 1 python/ machine-learning/ scikit-learn/ amazon-sagemaker/ data-processing

I'm building a Sagemaker batch inferencing pipeline and get confused about the options to process features (before inferencing) between using sagemaker.sklearn.processing.SKLearnProcessor and sagemaker.sklearn.estimator.SKLearn My understanding of these two options are:我正在构建一个 Sagemaker 批量推理管道，并对使用sagemaker.sklearn.processing.SKLearnProcessor和sagemaker.sklearn.estimator.SKLearn之间处理特征（推理前）的选项感到困惑我对这两个选项的理解是：

There are docs from aws to use sagemaker.sklearn.estimator.SKLearn to do the batch transformation to process the data.有来自 aws 的文档使用sagemaker.sklearn.estimator.SKLearn进行批量转换来处理数据。 The pros of using this class and its .create_model() method is that I can incorporate the created model(to process the feature before inferencing) to sagemaker.pipeline.PipelineModel which's deployed on endpoint.使用此 class 及其.create_model()方法的优点是我可以将创建的模型（在推理之前处理该功能）合并到部署在端点上的sagemaker.pipeline.PipelineModel 。 so the whole pipeline is behind a single endpoint to be called when inference request input in. This detailed from: https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html I don't know the specific cons, and that's the first question (1).所以整个管道在推理请求输入时被调用的单个端点后面。详细信息来自： https：//sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline% 20with%20Scikit-learn%20and%20Linear%20Learner.html具体cons我也不知道，就是第一个问题（1）。

However, if it's only for data processing, I can also use sagemaker.sklearn.processing.SKLearnProcessor to create Sagemaker Processing jobs to process features, then dump to s3 for model to batch inferencing.但是，如果仅用于数据处理，我还可以使用sagemaker.sklearn.processing.SKLearnProcessor创建 Sagemaker Processing 作业来处理特征，然后转储到 model 的 s3 以进行批量推理。 The pros to me is that it's making more sense to me to have a job that designed for processing, but cons is that it seems like I have to write a handler to pipeline the processing and inferencing myself, unlike the sagemaker.sklearn.estimator.SKLearn.对我来说，优点是拥有一份专为处理而设计的工作对我来说更有意义，但缺点是，与 sagemaker.sklearn.estimator 不同，我似乎必须编写一个处理程序来管道处理和推理自己。学习。 https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html So, my next question (2) is there a way to involve SKLearnProcessor in the sagemaker.pipeline.PipelineModel? https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html那么，我的下一个问题 (2) 是否有一种模型方法可以让 SKLearnProcessor 参与 sagePipemaker.pipeline？ if not, the following up question (3) is that if SKLearnProcessor is not designed for using in inferencing, what's the use case of it.如果不是，接下来的问题 (3) 是，如果 SKLearnProcessor 不是为在推理中使用而设计的，它的用例是什么。

The final question (4) is that from efficiency perspective, what's pros and cons using each method in a Sagemaker batch inferencing pipeline?最后一个问题 (4) 是，从效率的角度来看，在 Sagemaker 批量推理管道中使用每种方法的优缺点是什么？

1 个解决方案

SageMaker Inference Pipeline is a functionality of SageMaker hosting whereby you can create a serial inference pipeline (chain of containers) on an endpoint and/or Batch Transform Job. SageMaker 推理管道是 SageMaker 托管的一项功能，您可以借此在端点和/或批量转换作业上创建串行推理管道（容器链）。

With regards to the link you shared, a common pattern is to use two containers where one container hosts the Scikit-learn model which will act as the pre-processing step before passing the request onto the second container which hosts the model either on an endpoint or Batch Transform Job.关于您共享的链接，一种常见的模式是使用两个容器，其中一个容器托管 Scikit-learn model，这将作为预处理步骤，然后将请求传递到第二个容器，该容器在端点或批处理上托管 model转变工作。

The SKLearnProcessor is used to kick off a SKLearn Processing Job. SKLearnProcessor用于启动 SKLearn 处理作业。 You can use the SKLearnProcessor with a processing script to process your data.您可以使用带有处理脚本的 SKLearnProcessor 来处理您的数据。 As such, SKLearnProcessor cannot be used in a Serial Inference Pipeline ( sagemaker.pipeline.PipelineModel ).因此，SKLearnProcessor 不能用于串行推理管道 ( sagemaker.pipeline.PipelineModel )。
As stated above SKLearnProcessor is designed to kick off a SageMaker Processing Job that makes use of the Scikit-learn container that can be used for data pre- or post-processing and model evaluation workloads.如上所述， SKLearnProcessor旨在启动 SageMaker 处理作业，该作业利用可用于数据预处理或后处理的 Scikit-learn 容器和 model 评估工作负载。 Kindly see this link for more information.请参阅此链接以获取更多信息。
Are you are trying to decide whether to process your data with SKLearnProcessor (Processing Job) or make use of a PipelineModel that contains a preprocessing step in a Batch Transform Job?您是否正在尝试决定是使用SKLearnProcessor （处理作业）处理您的数据，还是使用包含批量转换作业中预处理步骤的PipelineModel ？

If so, making the decision depends on your use case.如果是这样，做出决定取决于您的用例。 If you were to use use a Processing Job ( SKLearnProcessor ) then the Job would need be to kicked off before the Batch Transform Job.如果您要使用处理作业 ( SKLearnProcessor )，则需要在批量转换作业之前启动该作业。 Once the Processing Job has completed you can then kick of the Batch Transform Job with the output of the Processing Job as input to the Batch Transform Job.处理作业完成后，您可以使用处理作业的 output 作为批量转换作业的输入来启动批量转换作业。

On the other hand, if you were to use Serial Inference Pipeline ( sagemaker.pipeline.PipelineModel ) then you would just need to make sure that the first container preprocesses the request to make sure it is compliant with what the model expects.另一方面，如果您要使用串行推理管道 ( sagemaker.pipeline.PipelineModel )，那么您只需要确保第一个容器预处理请求以确保它符合 model 的预期。 This option would entail the processing being done on a request(s) basis within the Batch Transform Job itself.此选项需要在批处理转换作业本身内根据请求完成处理。