在 Batch Transform SageMaker 中推理前预处理数据

Question

下午好，

我正在尝试在 SageMaker 上使用最近训练的 model 进行批量推理。 我将数据集转换为 json 格式。 根据我正在做的课程，你应该在“serve.py”中有四个函数，然后创建 Session() 和 model 以最终用model.transformer和.transform(...)提供数据。 json 文件中的数据没有经过预处理，因为它试图模拟现实生活中的数据，而相同的数据在训练步骤中进行了预处理（删除了某些列、onehot 编码和缩放器）。 因此，在传递数据时，需要处理数据以匹配训练中使用的数据。 此预处理步骤本应包含在input_fn函数中，但在尝试进行推理时收到错误：

ValueError( sagemaker_containers._errors.ClientError: X 有 37 个特征，但 ColumnTransformer 期望 16 个特征作为输入<

数据实际上有 16 个特征加上 class 列。

我试过了：

将strategy="MultiRecord"更改为 SingleRecord 但错误指出“X 有 8 个功能，但 ColumnTransformer 需要 16 个”
从input_fn完全删除预处理步骤，但正如预期的那样，它在找到非矢量化数据时失败了。

接下来可以找到我正在使用的完整代码。 请知道它适用于无服务器推理

%%writefile serve.py

import os
import joblib
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

def model_fn(model_dir):
    """Load and return the model"""
    model_file_name = "pipeline_model.joblib"
    pipeline_model = joblib.load(os.path.join(model_dir, model_file_name))
    
    return pipeline_model
      
def input_fn(request_body, request_content_type):
    """Process the input json data and return the processed data.
    You can also add any input data pre-processing in this function
    """
    if request_content_type == "application/json":
        input_object = pd.read_json(request_body, lines=True)
       #print(input_object.shape)
       #print(input_object.head())
        cat_cols = ["job", "marital", "education", "default", "housing", "loan", "month", "poutcome"]
        cont_cols = ["age", "pdays", "previous", "emp_var_rate","cons_price_idx",  "cons_conf_idx",  "euribor3m","nr_employed"]
        input_object.drop(["y"],axis=1, inplace=True)
       #
       #
       ## One hot encode the categorical columns
        ohe = OneHotEncoder(drop="first")
       ## Scale the continuous columns
        sc = StandardScaler()

       ## Column transformer to apply transformations on both categorical and continuous columns
        ct = ColumnTransformer([
           ("One Hot Encoding", ohe, cat_cols),
           ("Scaling", sc, cont_cols)
        ])
       #      
       ###correct
        input_object = ct.fit_transform(input_object)
        
        return input_object
    else:
        raise ValueError("Only application/json content type supported!")  

def predict_fn(input_object, pipeline_model):
    """Make predictions on processed input data"""
    predictions = pipeline_model.predict(input_object)
    pred_probs = pipeline_model.predict_proba(input_object)
    
    prediction_object = pd.DataFrame(
        {
            "prediction": predictions.tolist(),
            "pred_prob_class0": pred_probs[:, 0].tolist(),
            "pred_prob_class1": pred_probs[:, 1].tolist()
        }
    )
    
    return prediction_object

def output_fn(prediction_object, request_content_type):
    """Post process the predictions and return as json"""
    return_object = prediction_object.to_json(orient="records", lines=True)
    
    return return_object

# Create the deployment - same as Real Time Inference Code!
from sagemaker.sklearn.model import SKLearnModel
from sagemaker import Session, get_execution_role

session = Session()
bucket = session.default_bucket()

training_job_name = "rfc-pipeline-tuner-221223-1512-009-d5b7f868" # TODO: Update with best TrainingJobName from hyperparameter tuning
model_artifact = f"s3://{bucket}/{training_job_name}/output/model.tar.gz"
endpoint_name = "bank-prediction-rfc-pipeline-batch-transform"

model = SKLearnModel(
    name=endpoint_name,
    framework_version="1.0-1",
    entry_point="serve.py",
    dependencies=["requirements.txt"],
    model_data=model_artifact,
    role=get_execution_role(),
    sagemaker_session = session
)

# Create a batch transformer from the base model
output_path = f"s3://{bucket}/sagemaker/bank-prediction/test_preds"
batch_transformer = model.transformer(instance_count=1, 
                                      instance_type="ml.m5.xlarge",
                                      strategy="MultiRecord",
                                      accept="application/json",
                                      assemble_with="Line", 
                                      output_path=output_path)

%%time
# Feed the test data
test_data_path = "s3://sagemaker-eu-west-2-262713471428/sagemaker/bank-prediction1/bigtest.json" 
batch_transformer.transform(test_data_path, data_type="S3Prefix", content_type="application/json", split_type="Line")

Answer 1

所以问题是其中一列有一个额外的类别，在训练期间没有看到，model 无法对其进行矢量化。

数据未经处理就被馈送，SageMaker 上的 model 根据训练过程执行矢量化和必要的转换，使其适合训练后的 model。无需像我尝试的那样在推理步骤中添加标准化和 OneHotEncoder 等预处理步骤去做。 我通过删除包含额外类别的行（这是一个简单的是）并提供数据并从input_fn中删除处理步骤来解决我的问题。

根据这篇文章的错误，当我只输入 16 个时，它显示了 37 个特征。好吧，这是因为 SageMaker 中的 ML model 处理了原始数据并且 OneHotEncoded 它创建了更多列。 然后，它遇到了我放入input_fn的处理步骤，又尝试了一遍。

我将 SageMaker 与 SKLearn 模型一起使用，它可能与其他模型不同。

在 Batch Transform SageMaker 中推理前预处理数据

问题描述

1 个解决方案

解决方案1
1 2022-12-29 10:14:10

在 Batch Transform SageMaker 中推理前预处理数据

问题描述

1 个解决方案

解决方案1 1 2022-12-29 10:14:10

解决方案1
1 2022-12-29 10:14:10