如何在 SageMaker 實時推理中使用所有 GPU？

Question

我在單個 gpu 實例中部署了一個 model 實時推理，它工作正常。

現在我想使用多個 GPU 來減少推理時間，我需要在我的 inference.py 中更改什么才能使其工作？

這是我的一些代碼：

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
    logger.info("Loading first model...")
    model = Model().to(DEVICE)
    with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
        model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
    model = model.eval()
    
    logger.info("Loading second model...")
    model_2 = Model_2()
    model_2.to(DEVICE)
    checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
    model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
    model_2 = model_2()
    
    logger.info('Done loading models')
    return {'first_model': model, 'second_model': model_2}

def input_fn(request_body, request_content_type):
    assert request_content_type=='application/json'
    url = json.loads(request_body)['url']
    save_name = json.loads(request_body)['save_name']
    logger.info(f'Image url: {url}')
    img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
    w, h = img.size
    input_tensor = preprocess(img)
    input_batch = input_tensor.unsqueeze(0).to(DEVICE)
    logger.info('Image ready to predict!')
    return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}

def predict_fn(input_object, model):
    data = input_object['tensor']
    logger.info('Generating prediction based on the input image')
    model_1 = model['first_model']
    model_2 = model['second_model']
    d0, d1, d2, d3, d4, d5, d6 = model_1(data)
    torch.cuda.empty_cache()
    mask = torch.argmax(d0[0], axis=0).cpu().numpy()
    mask = np.where(mask==2, 255, mask)
    mask = np.where(mask==1, 128, mask)
    img = input_object['image']
    final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
    img = np.array(img)[:,:,::-1]
    final_image = np.array(final_image)
    image_dict = to_dict(img, final_image)
    final_image = model_2_process(model_2, image_dict)
    torch.cuda.empty_cache()
    
    return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}

我在想也許使用 torch multiprocessing，有什么建議嗎？

Answer 1

提到 Torch DDP 和 DP 的答案並不完全合適，因為這些庫的價值在於進行多 GPU 梯度下降（特別是對 GPU 間的梯度進行平均），如 1. 中所述，這不會在推理時發生。實際上，一個完美的優化推理甚至根本不使用 PyTorch 或 TensorFlow，而是使用僅預測優化的運行時，例如 SageMaker Neo、ONNXRuntime 或 NVIDIA TensorRT，以減少 memory 占用空間和延遲。

要推斷適合 GPU 的單個 model，通常不建議使用多 GPU 實例：推斷是一項無共享任務，因此您可以使用 N 個單 GPU 實例，並且事情更簡單且性能相同。 多 GPU 主機上的推理在 2 種情況下很有用：(1) 如果您執行 model 並行推理（不是您的情況）或 (2) 如果您的服務推理包含相互調用的模型圖。 在這種情況下，DAG 中調用的各種模型的接近度可以減少延遲。 你的情況好像是這樣

我的建議如下：

嘗試使用 NVIDIA Triton，它很好地支持那些 DAG 用例，並且在 SageMaker 上受支持。 https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
如果你想做一些自定義的事情，你可以嘗試將這兩個模型分配給 PyTorch 中不同的 cuda 設備 ID。因為 cuda 內核是異步運行的，如果你的模型可以運行，這可能足以具有一些並行性和一點加速 vs 1 GPU平行線

我看到一次使用多處理（與 MXNet）來跨 GPU 負載平衡推理請求（在這篇AWS 博客文章中），但它用於推理批次的無共享、映射式分布。 在您的情況下，您似乎必須在 model 之間建立連接，因此 Triton 可能更合適。

最終，如果您的目標是減少延遲，還有其他想法：

修復任何 CPU 瓶頸您的代碼似乎有很多 CPU 工作（預處理，numpy ...）。 你確定 GPU 是瓶頸嗎？ 如果 CPU 占用率超過 80%，請嘗試使用大型單 GPU G5，例如 G5.16xlarge。 它們非常適合計算機視覺推理
如果您使用的是 P2、P3 或 G4dn，請使用更好的 GPU，改用 G5
優化代碼。 根據瓶頸，可以嘗試 2 件事：
1. 如果在 Torch 中進行推理，請盡量避免使用 Numpy 進行代數運算，並盡可能使用 GPU 上的 torch 張量進行運算。
2. 如果 GPU 是瓶頸，請嘗試將 PyTorch 替換為 ONNXRuntime 或 NVIDIA TensorRT。

Answer 2

您必須使用torch.nn.DataParallel或torch.nn.parallel.DistributedDataParallel （閱讀“ 多 GPU 示例”和“ 使用 nn.parallel.DistributedDataParallel 而不是多處理或 nn.DataParallel ”）。

您必須至少傳遞以下三個參數來調用 function：

模塊 (Module) – 要並行化的模塊（您的模型）

device_ids （python:int 或 torch.device 的列表）——CUDA 設備。

對於單設備模塊，device_ids可以只包含一個設備id，表示該進程對應的輸入模塊所在的唯一一個CUDA設備。 或者，device_ids 也可以是 None。

對於多設備模塊和 CPU 模塊，device_ids 必須為 None。 當這兩種情況的 device_ids 都為 None 時，前向傳遞的輸入數據和實際模塊都必須放置在正確的設備上。 （默認值：無）

output_device （int 或 torch.device）——單設備 CUDA 模塊的 output 的設備位置。

對於多設備模塊和 CPU 模塊，它必須是 None，模塊本身指示 output 位置。 （默認值：單設備模塊的 device_ids[0]）

例如：

from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

如何在 SageMaker 實時推理中使用所有 GPU？

問題描述

2 個解決方案

解決方案1
1 2022-11-25 00:44:07

解決方案2
0 2022-11-14 20:21:50

如何在 SageMaker 實時推理中使用所有 GPU？

問題描述

2 個解決方案

解決方案1 1 2022-11-25 00:44:07

解決方案2 0 2022-11-14 20:21:50

解決方案1
1 2022-11-25 00:44:07

解決方案2
0 2022-11-14 20:21:50