ONNX & TensorRT
ONNX (Open Neural Network Exchange) is the universal format for exporting models from any framework. TensorRT compiles ONNX models into GPU-optimised engines that can be 10× faster than the original PyTorch model.
The Deployment Stack
Training
Portable format
or TensorRT
Optimised inference
GPU / CPU / Edge
What is ONNX?
ONNX is an open standard for representing ML models as computation graphs. Once a model is exported to ONNX, it can run on any ONNX-compatible runtime — regardless of which framework trained it. It acts as a lingua franca between training frameworks and deployment targets.
PyTorch, TensorFlow, Keras, Scikit-Learn, XGBoost, MXNet
ONNX Runtime, TensorRT, OpenVINO, CoreML, TFLite (via conversion)
NVIDIA GPU, AMD GPU, Intel CPU, ARM (Raspberry Pi, Android), browser (WebAssembly)
Export PyTorch Model to ONNX
import torch
import torch.nn as nn
model = MyModel()
model.eval()
# Create dummy input (same shape as real input)
dummy_input = torch.randn(1, 3, 224, 224) # Batch=1, RGB image, 224×224
# Export
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=17, # ONNX opset version
input_names=["images"],
output_names=["logits"],
dynamic_axes={ # Allow variable batch size
"images": {0: "batch"},
"logits": {0: "batch"}
}
)
print("Exported to model.onnx") ONNX Runtime Inference
import onnxruntime as ort
import numpy as np
# Load model — auto-selects GPU if available
session = ort.InferenceSession(
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
# Inspect inputs
for inp in session.get_inputs():
print(f"Input: {inp.name}, shape: {inp.shape}")
# Run inference
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"images": input_data})
print(f"Output shape: {outputs[0].shape}") TensorRT — Maximum GPU Performance
TensorRT is NVIDIA's deep learning inference optimizer. It:
- Fuses layers — combines Conv + BatchNorm + ReLU into a single GPU kernel
- Quantises weights — converts to INT8/FP16 automatically
- Selects optimal kernels — benchmarks multiple implementations and picks fastest
- Allocates memory optimally — minimises peak memory usage
from torch2trt import torch2trt
model = MyModel().eval().cuda()
dummy = torch.randn(1, 3, 224, 224).cuda()
# Compile with TensorRT (takes 1-5 minutes first time)
model_trt = torch2trt(
model,
[dummy],
fp16_mode=True, # Use FP16 Tensor Cores
max_batch_size=32
)
# Save compiled engine
torch.save(model_trt.state_dict(), 'model_trt.pth')
# Inference is now 3-8× faster
output = model_trt(dummy) Optimising Transformers/LLMs with ONNX
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
# Automatically export and optimise a HuggingFace model
model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
export=True
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
inputs = tokenizer("This is amazing!", return_tensors="pt")
outputs = model(**inputs)
# ~2-3× faster than vanilla PyTorch on CPU! Frequently Asked Questions
Why doesn't TensorRT work with dynamic shapes?
TensorRT compiles kernels for specific input shapes. Dynamic shapes require compiling multiple profiles (min/opt/max shape), which increases compilation time and memory. For simplicity, fix the batch size if possible, or specify explicit shape ranges in the builder config.
How do I verify an ONNX export is correct?
Use onnx.checker.check_model(model) to validate the graph. Then compare outputs: run the same input through PyTorch and ONNX Runtime, and assert that outputs are close (within numerical precision): np.testing.assert_allclose(pytorch_out, onnx_out, rtol=1e-3, atol=1e-5).
Is TensorRT only for NVIDIA GPUs?
Yes. TensorRT is NVIDIA-proprietary. For AMD GPUs, use MIGraphX or ROCm. For Intel CPUs/NPUs, use OpenVINO. For Apple Silicon, use CoreML. ONNX Runtime supports all these backends, so exporting to ONNX first gives you the most portability.
Frequently Asked Questions
What will I learn here?
This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.
How should I use this page?
Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.
What should I read next?
Use the navigation below to continue to the next lesson or explore related topics.