[英]Running Pytorch Quantized Model on CUDA GPU
I am confused about whether it is possible to run an int8 quantized model on CUDA, or can you only train a quantized model on CUDA with fakequantise for deployment on another backend such as a CPU.
我想用实际的 int8 指令而不是 FakeQuantised float32 指令在 CUDA 上运行 model,并享受效率提升。 Pytorch 文档对此奇怪地不具体。 如果可以使用不同的框架(例如TensorFlow
)在 CUDA 上运行量化的 model,我很想知道。
这是准备我的量化 model 的代码(使用训练后量化)。 model 是带有 nn.Conv2d 和 nn.LeakyRelu 和 nn.MaxPool 模块的普通 CNN:
model_fp = torch.load(models_dir+net_file)
model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)
qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')
train_data = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)
for i, (input, _) in enumerate(train_loader):
if i > 1: break
print('batch', i+1, end='\r')
input = input.to('cuda:0')
model_prepped(input)
这实际上量化了 model:
model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()
这是尝试在 CUDA 上运行量化的 model,并引发 NotImplementedError,当我在 CPU 上运行它时它工作正常:
model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
input = input.to('cuda:0')
out = model_quantised(input)
print(out, out.shape)
break
这是错误:
Traceback (most recent call last):
File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
out = model_quantised(input)
File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend.
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].
在 [this][1] 博客中,您似乎无法在 GPU 上运行量化模型。
PyTorch 中的量化目前仅限 CPU。 量化不是特定于 CPU 的技术(例如 NVIDIA 的 TensorRT 可用于在 GPU 上实现量化)。 然而,GPU 上的推理时间通常已经“足够快”了,CPU 对于大规模 model 服务器部署更具吸引力(由于复杂的成本因素不在本文的 Z31A1FD140BE4BEF2D11E121EC9A18A 范围内)。 因此,从 PyTorch 1.6 开始,本机 API 中只有 CPU 后端可用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.