Sagemaker 优化内置算法的批量转换时间

Question

I've got an XGBoost model trained leveraging Sagemaker Hyperparameter Tuning job.我有一个利用 Sagemaker 超参数调整作业训练的 XGBoost 模型。 Now, I want to generate predictions for about 182GB of Csv files.现在，我想为大约 182GB 的 Csv 文件生成预测。 I've been testing different combinations of instance_types, counts, MaxPayloadInMB, and MaxConcurrentTransforms but haven't been able to run this fast than about 30 minutes... I wanted to see if I'm missing anything to speed this up?我一直在测试 instance_types、计数、MaxPayloadInMB 和 MaxConcurrentTransforms 的不同组合，但一直无法以超过 30 分钟的速度快速运行……我想看看我是否遗漏了任何可以加快速度的东西？ Here is my current boto3 call:这是我当前的 boto3 调用：

response = client.create_transform_job(
  TransformJobName=transform_name,
  ModelName=model_name,
  BatchStrategy='MultiRecord',
  TransformInput={
    "DataSource": {
      "S3DataSource":{
        "S3DataType": "S3Prefix",
        "S3Uri": f"s3://{bucket}/{prefix}/csv_prediction"
      }
    },
    "ContentType": "text/csv",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  MaxPayloadInMB=1,
  MaxConcurrentTransforms=100,
  DataProcessing={
    "InputFilter": "$[1:]",  # Use all columns except first (containing ID)
    "JoinSource": "Input",
    "OutputFilter": "$[0,-1]"  # Return ID and Prediction only 
  },
  TransformOutput={
    "S3OutputPath": f"s3://{bucket}/{prefix}/batch_transform_results/{model_name}",
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  TransformResources={
    "InstanceType": "ml.c5.xlarge",
    "InstanceCount": 16
  }
)

Answer 1

When you use an instance type with more CPU cores, generally that means you can increase MaxConcurrentTransforms, which controls the number of concurrent /invocations requests in-flight to the Model server at any given.当您使用具有更多 CPU 内核的实例类型时，通常这意味着您可以增加 MaxConcurrentTransforms，它控制在任何给定的情况下对模型服务器进行的并发/调用请求的数量。 The rule of thumb is to set MaxConcurrentTransformsequal to the number of cores, although requires some empirical testing to find out if your particular Model implementation can keep up with a faster request rate without breaking.经验法则是将 MaxConcurrentTransformsequal 设置为核心数，尽管需要进行一些经验测试以确定您的特定模型实现是否可以在不中断的情况下跟上更快的请求率。 Generally Model servers DO match the rule of thumb, setting number of webserver workers equal to the number of cores.通常，模型服务器确实符合经验法则，将网络服务器工作人员的数量设置为等于核心数量。

There may also be room to tune the BatchStrategy and MaxPayloadInMB for better throughput, eg passing larger multi-record payloads will allow the Model to complete the same amount of work with less total requests, thus reducing any overhead that may build up from frequent HTTP communication.也可能有空间调整 BatchStrategy 和 MaxPayloadInMB 以获得更好的吞吐量，例如传递更大的多记录有效负载将允许模型以更少的总请求完成相同数量的工作，从而减少可能因频繁的 HTTP 通信而累积的任何开销. Again it depends on how large of a request payload the Model server can handle, which may also depend on how much memory is needed and available on the given instance type.同样，它取决于模型服务器可以处理的请求有效载荷有多大，这也可能取决于给定实例类型需要和可用的内存量。

Answer 2

Sometimes using a larger instance will not only be faster, but also more cost-effective.有时使用更大的实例不仅会更快，而且更划算。 Because if the job finishes much faster, the overall cost may be less, even though the instance is more expensive.因为如果作业完成得更快，即使实例更昂贵，总体成本也可能会更低。

With that said, have you considered using something larger than an xlarge ?话虽如此，您是否考虑过使用比xlarge更大的东西？ That's the third smallest compute-optimized instance type.这是第三小的计算优化实例类型。 You can go all the way up to 24xlarge with the c5 instance type, with 5 other sizes in-between .您可以使用c5实例类型一直到24xlarge ，中间有5 个其他大小。 Plus, there's a newer generation, c6g , of Graviton based instances.此外，还有新一代的基于Graviton的实例c6g 。

However, XGBoost is a memory-bound, not compute-bound algorithm .然而， XGBoost 是一种内存限制而非计算限制的算法。 So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5).因此，通用计算实例（例如 M5）是比计算优化实例（例如 C5）更好的选择。

Have you tried using AWS's built-in algorithm for XGBoost, which has some optimizations for the environment?您是否尝试过使用 AWS 的 XGBoost内置算法，它对环境有一些优化？ For XGBoost, the docs say that, "[ the built-in] implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics than the original versions ."对于 XGBoost，文档说，“[内置] 实现比原始版本具有更小的内存占用、更好的日志记录、改进的超参数验证以及扩展的指标集。”

Finally -- and this may be the solution in combination with using the built-in algorithm -- have you checked AWS's "EC2 Instance Recommendation for the XGBoost Algorithm" ?最后 - 这可能是结合使用内置算法的解决方案 - 您是否检查过 AWS 的“XGBoost 算法的 EC2 实例推荐” ？ Here's an excerpt from that (with my emphasis):这是其中的摘录（我强调）：

SageMaker XGBoost version 1.2 or later supports single-instance GPU training. SageMaker XGBoost 版本 1.2 或更高版本支持单实例 GPU 训练。 Despite higher per-instance costs, GPUs train more quickly, making them more cost effective.尽管每个实例的成本更高，但 GPU 的训练速度更快，从而使其更具成本效益。 SageMaker XGBoost version 1.2 or later supports P2 and P3 instances. SageMaker XGBoost 版本 1.2 或更高版本支持 P2 和 P3 实例。

SageMaker XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families. SageMaker XGBoost 版本 1.2-2 或更高版本支持 P2、P3、G4dn 和 G5 GPU 实例系列。

To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3) and set the tree_method hyperparameter to gpu_hist in your existing XGBoost script.要利用 GPU 训练，请将实例类型指定为 GPU 实例之一（例如 P3），并在现有 XGBoost 脚本中将 tree_method 超参数设置为 gpu_hist。 SageMaker XGBoost currently does not support multi-GPU training. SageMaker XGBoost 目前不支持多 GPU 训练。

Sagemaker 优化内置算法的批量转换时间

问题描述

2 个解决方案

解决方案1
1 2022-12-16 23:08:35

解决方案2
1 2022-12-19 17:41:29

Sagemaker 优化内置算法的批量转换时间

问题描述

2 个解决方案

解决方案1 1 2022-12-16 23:08:35

解决方案2 1 2022-12-19 17:41:29

解决方案1
1 2022-12-16 23:08:35

解决方案2
1 2022-12-19 17:41:29