如何在 SageMaker 实时推理中使用所有 GPU？

Question

我在单个 gpu 实例中部署了一个 model 实时推理，它工作正常。

现在我想使用多个 GPU 来减少推理时间，我需要在我的 inference.py 中更改什么才能使其工作？

这是我的一些代码：

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
    logger.info("Loading first model...")
    model = Model().to(DEVICE)
    with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
        model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
    model = model.eval()
    
    logger.info("Loading second model...")
    model_2 = Model_2()
    model_2.to(DEVICE)
    checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
    model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
    model_2 = model_2()
    
    logger.info('Done loading models')
    return {'first_model': model, 'second_model': model_2}

def input_fn(request_body, request_content_type):
    assert request_content_type=='application/json'
    url = json.loads(request_body)['url']
    save_name = json.loads(request_body)['save_name']
    logger.info(f'Image url: {url}')
    img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
    w, h = img.size
    input_tensor = preprocess(img)
    input_batch = input_tensor.unsqueeze(0).to(DEVICE)
    logger.info('Image ready to predict!')
    return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}

def predict_fn(input_object, model):
    data = input_object['tensor']
    logger.info('Generating prediction based on the input image')
    model_1 = model['first_model']
    model_2 = model['second_model']
    d0, d1, d2, d3, d4, d5, d6 = model_1(data)
    torch.cuda.empty_cache()
    mask = torch.argmax(d0[0], axis=0).cpu().numpy()
    mask = np.where(mask==2, 255, mask)
    mask = np.where(mask==1, 128, mask)
    img = input_object['image']
    final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
    img = np.array(img)[:,:,::-1]
    final_image = np.array(final_image)
    image_dict = to_dict(img, final_image)
    final_image = model_2_process(model_2, image_dict)
    torch.cuda.empty_cache()
    
    return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}

我在想也许使用 torch multiprocessing，有什么建议吗？

Answer 1

提到 Torch DDP 和 DP 的答案并不完全合适，因为这些库的价值在于进行多 GPU 梯度下降（特别是对 GPU 间的梯度进行平均），如 1. 中所述，这不会在推理时发生。实际上，一个完美的优化推理甚至根本不使用 PyTorch 或 TensorFlow，而是使用仅预测优化的运行时，例如 SageMaker Neo、ONNXRuntime 或 NVIDIA TensorRT，以减少 memory 占用空间和延迟。

要推断适合 GPU 的单个 model，通常不建议使用多 GPU 实例：推断是一项无共享任务，因此您可以使用 N 个单 GPU 实例，并且事情更简单且性能相同。 多 GPU 主机上的推理在 2 种情况下很有用：(1) 如果您执行 model 并行推理（不是您的情况）或 (2) 如果您的服务推理包含相互调用的模型图。 在这种情况下，DAG 中调用的各种模型的接近度可以减少延迟。 你的情况好像是这样

我的建议如下：

尝试使用 NVIDIA Triton，它很好地支持那些 DAG 用例，并且在 SageMaker 上受支持。 https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
如果你想做一些自定义的事情，你可以尝试将这两个模型分配给 PyTorch 中不同的 cuda 设备 ID。因为 cuda 内核是异步运行的，如果你的模型可以运行，这可能足以具有一些并行性和一点加速 vs 1 GPU平行线

我看到一次使用多处理（与 MXNet）来跨 GPU 负载平衡推理请求（在这篇AWS 博客文章中），但它用于推理批次的无共享、映射式分布。 在您的情况下，您似乎必须在 model 之间建立连接，因此 Triton 可能更合适。

最终，如果您的目标是减少延迟，还有其他想法：

修复任何 CPU 瓶颈您的代码似乎有很多 CPU 工作（预处理，numpy ...）。 你确定 GPU 是瓶颈吗？ 如果 CPU 占用率超过 80%，请尝试使用大型单 GPU G5，例如 G5.16xlarge。 它们非常适合计算机视觉推理
如果您使用的是 P2、P3 或 G4dn，请使用更好的 GPU，改用 G5
优化代码。 根据瓶颈，可以尝试 2 件事：
1. 如果在 Torch 中进行推理，请尽量避免使用 Numpy 进行代数运算，并尽可能使用 GPU 上的 torch 张量进行运算。
2. 如果 GPU 是瓶颈，请尝试将 PyTorch 替换为 ONNXRuntime 或 NVIDIA TensorRT。

Answer 2

您必须使用torch.nn.DataParallel或torch.nn.parallel.DistributedDataParallel （阅读“ 多 GPU 示例”和“ 使用 nn.parallel.DistributedDataParallel 而不是多处理或 nn.DataParallel ”）。

您必须至少传递以下三个参数来调用 function：

模块 (Module) – 要并行化的模块（您的模型）

device_ids （python:int 或 torch.device 的列表）——CUDA 设备。

对于单设备模块，device_ids可以只包含一个设备id，表示该进程对应的输入模块所在的唯一一个CUDA设备。 或者，device_ids 也可以是 None。

对于多设备模块和 CPU 模块，device_ids 必须为 None。 当这两种情况的 device_ids 都为 None 时，前向传递的输入数据和实际模块都必须放置在正确的设备上。 （默认值：无）

output_device （int 或 torch.device）——单设备 CUDA 模块的 output 的设备位置。

对于多设备模块和 CPU 模块，它必须是 None，模块本身指示 output 位置。 （默认值：单设备模块的 device_ids[0]）

例如：

from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

如何在 SageMaker 实时推理中使用所有 GPU？

问题描述

2 个解决方案

解决方案1
1 2022-11-25 00:44:07

解决方案2
0 2022-11-14 20:21:50

如何在 SageMaker 实时推理中使用所有 GPU？

问题描述

2 个解决方案

解决方案1 1 2022-11-25 00:44:07

解决方案2 0 2022-11-14 20:21:50

解决方案1
1 2022-11-25 00:44:07

解决方案2
0 2022-11-14 20:21:50