在 Batch Transform SageMaker 中推理前预处理数据

Question

Good afternoon,下午好，

I am trying to use a recently trained model on SageMaker to do Batch inference.我正在尝试在 SageMaker 上使用最近训练的 model 进行批量推理。 I have the dataset converted in json format.我将数据集转换为 json 格式。 According to the course I am doing, you should have four functions into "serve.py", then the Session() and the model are created to finally feed the data with model.transformer and .transform(...) .根据我正在做的课程，你应该在“serve.py”中有四个函数，然后创建 Session() 和 model 以最终用model.transformer和.transform(...)提供数据。 The data in the json file is not preprocessed as it is trying to mock real life data, while the same data was preprocessed (removed certain columns, onehot encoding and Scaler) in the training step. json 文件中的数据没有经过预处理，因为它试图模拟现实生活中的数据，而相同的数据在训练步骤中进行了预处理（删除了某些列、onehot 编码和缩放器）。 Therefore when passing the data it needs data processed to match the data used on training.因此，在传递数据时，需要处理数据以匹配训练中使用的数据。 This preprocessing steps were supposed to be included in the input_fn fuction, but when trying to do the inference it received an error:此预处理步骤本应包含在input_fn函数中，但在尝试进行推理时收到错误：

ValueError( sagemaker_containers._errors.ClientError: X has 37 features, but ColumnTransformer is expecting 16 features as input< ValueError( sagemaker_containers._errors.ClientError: X 有 37 个特征，但 ColumnTransformer 期望 16 个特征作为输入<

The data in fact has 16 features plus the class column.数据实际上有 16 个特征加上 class 列。

I have tried:我试过了：

Changing strategy="MultiRecord" to SingleRecord but the error states that "X has 8 features, but ColumnTransformer is expecting 16"将strategy="MultiRecord"更改为 SingleRecord 但错误指出“X 有 8 个功能，但 ColumnTransformer 需要 16 个”
Removing the preprocessing steps from input_fn completely, but as expected it fails when it finds non vectorized data.从input_fn完全删除预处理步骤，但正如预期的那样，它在找到非矢量化数据时失败了。

The entire code I am using can be found next.接下来可以找到我正在使用的完整代码。 Please know that it works fine with Serverless inference请知道它适用于无服务器推理

%%writefile serve.py

import os
import joblib
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

def model_fn(model_dir):
    """Load and return the model"""
    model_file_name = "pipeline_model.joblib"
    pipeline_model = joblib.load(os.path.join(model_dir, model_file_name))
    
    return pipeline_model
      
def input_fn(request_body, request_content_type):
    """Process the input json data and return the processed data.
    You can also add any input data pre-processing in this function
    """
    if request_content_type == "application/json":
        input_object = pd.read_json(request_body, lines=True)
       #print(input_object.shape)
       #print(input_object.head())
        cat_cols = ["job", "marital", "education", "default", "housing", "loan", "month", "poutcome"]
        cont_cols = ["age", "pdays", "previous", "emp_var_rate","cons_price_idx",  "cons_conf_idx",  "euribor3m","nr_employed"]
        input_object.drop(["y"],axis=1, inplace=True)
       #
       #
       ## One hot encode the categorical columns
        ohe = OneHotEncoder(drop="first")
       ## Scale the continuous columns
        sc = StandardScaler()

       ## Column transformer to apply transformations on both categorical and continuous columns
        ct = ColumnTransformer([
           ("One Hot Encoding", ohe, cat_cols),
           ("Scaling", sc, cont_cols)
        ])
       #      
       ###correct
        input_object = ct.fit_transform(input_object)
        
        return input_object
    else:
        raise ValueError("Only application/json content type supported!")  

def predict_fn(input_object, pipeline_model):
    """Make predictions on processed input data"""
    predictions = pipeline_model.predict(input_object)
    pred_probs = pipeline_model.predict_proba(input_object)
    
    prediction_object = pd.DataFrame(
        {
            "prediction": predictions.tolist(),
            "pred_prob_class0": pred_probs[:, 0].tolist(),
            "pred_prob_class1": pred_probs[:, 1].tolist()
        }
    )
    
    return prediction_object

def output_fn(prediction_object, request_content_type):
    """Post process the predictions and return as json"""
    return_object = prediction_object.to_json(orient="records", lines=True)
    
    return return_object

# Create the deployment - same as Real Time Inference Code!
from sagemaker.sklearn.model import SKLearnModel
from sagemaker import Session, get_execution_role

session = Session()
bucket = session.default_bucket()

training_job_name = "rfc-pipeline-tuner-221223-1512-009-d5b7f868" # TODO: Update with best TrainingJobName from hyperparameter tuning
model_artifact = f"s3://{bucket}/{training_job_name}/output/model.tar.gz"
endpoint_name = "bank-prediction-rfc-pipeline-batch-transform"

model = SKLearnModel(
    name=endpoint_name,
    framework_version="1.0-1",
    entry_point="serve.py",
    dependencies=["requirements.txt"],
    model_data=model_artifact,
    role=get_execution_role(),
    sagemaker_session = session
)

# Create a batch transformer from the base model
output_path = f"s3://{bucket}/sagemaker/bank-prediction/test_preds"
batch_transformer = model.transformer(instance_count=1, 
                                      instance_type="ml.m5.xlarge",
                                      strategy="MultiRecord",
                                      accept="application/json",
                                      assemble_with="Line", 
                                      output_path=output_path)

%%time
# Feed the test data
test_data_path = "s3://sagemaker-eu-west-2-262713471428/sagemaker/bank-prediction1/bigtest.json" 
batch_transformer.transform(test_data_path, data_type="S3Prefix", content_type="application/json", split_type="Line")

Answer 1

So the issue was that one of the columns had an extra category not seen during training and the model was not able to vectorize it.所以问题是其中一列有一个额外的类别，在训练期间没有看到，model 无法对其进行矢量化。

The data is fed without processing and the model on SageMaker performs the vectorization and the transformation necessary based on the training process so that it fits the trained model. No need to add preprocessing steps such as Standarization and OneHotEncoder in the inference step as I was trying to do.数据未经处理就被馈送，SageMaker 上的 model 根据训练过程执行矢量化和必要的转换，使其适合训练后的 model。无需像我尝试的那样在推理步骤中添加标准化和 OneHotEncoder 等预处理步骤去做。 I solved my issue by removing the row containing the extra category (which was a simple yes) and feeding the data and removing the processing steps from input_fn .我通过删除包含额外类别的行（这是一个简单的是）并提供数据并从input_fn中删除处理步骤来解决我的问题。

As per the error on this post, it says 37 features when I was feeding only 16. Well, this is because the ML model in SageMaker processed the raw data and OneHotEncoded it creating more columns.根据这篇文章的错误，当我只输入 16 个时，它显示了 37 个特征。好吧，这是因为 SageMaker 中的 ML model 处理了原始数据并且 OneHotEncoded 它创建了更多列。 Then, it encountered the processing steps I put into input_fn and tried to do it again.然后，它遇到了我放入input_fn的处理步骤，又尝试了一遍。

I was using SageMaker with SKLearn models, it might be different with other models.我将 SageMaker 与 SKLearn 模型一起使用，它可能与其他模型不同。

在 Batch Transform SageMaker 中推理前预处理数据

问题描述

1 个解决方案

解决方案1
1 2022-12-29 10:14:10

在 Batch Transform SageMaker 中推理前预处理数据

问题描述

1 个解决方案

解决方案1 1 2022-12-29 10:14:10

解决方案1
1 2022-12-29 10:14:10