ONNX & TensorRT

ONNX (Open Neural Network Exchange) is the universal format for exporting models from any framework. TensorRT compiles ONNX models into GPU-optimised engines that can be 10× faster than the original PyTorch model.

The Deployment Stack

PyTorch / TF
Training

→

ONNX
Portable format

→

ONNX Runtime
or TensorRT
Optimised inference

→

Production
GPU / CPU / Edge

What is ONNX?

ONNX is an open standard for representing ML models as computation graphs. Once a model is exported to ONNX, it can run on any ONNX-compatible runtime — regardless of which framework trained it. It acts as a lingua franca between training frameworks and deployment targets.

Export from:

PyTorch, TensorFlow, Keras, Scikit-Learn, XGBoost, MXNet

Run with:

ONNX Runtime, TensorRT, OpenVINO, CoreML, TFLite (via conversion)

Supported platforms:

NVIDIA GPU, AMD GPU, Intel CPU, ARM (Raspberry Pi, Android), browser (WebAssembly)

Export PyTorch Model to ONNX

Python · PyTorch → ONNX

import torch
import torch.nn as nn

model = MyModel()
model.eval()

# Create dummy input (same shape as real input)
dummy_input = torch.randn(1, 3, 224, 224)  # Batch=1, RGB image, 224×224

# Export
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=17,           # ONNX opset version
    input_names=["images"],
    output_names=["logits"],
    dynamic_axes={              # Allow variable batch size
        "images": {0: "batch"},
        "logits": {0: "batch"}
    }
)
print("Exported to model.onnx")

ONNX Runtime Inference

Python · ONNX Runtime

import onnxruntime as ort
import numpy as np

# Load model — auto-selects GPU if available
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Inspect inputs
for inp in session.get_inputs():
    print(f"Input: {inp.name}, shape: {inp.shape}")

# Run inference
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"images": input_data})
print(f"Output shape: {outputs[0].shape}")

TensorRT — Maximum GPU Performance

TensorRT is NVIDIA's deep learning inference optimizer. It:

Fuses layers — combines Conv + BatchNorm + ReLU into a single GPU kernel
Quantises weights — converts to INT8/FP16 automatically
Selects optimal kernels — benchmarks multiple implementations and picks fastest
Allocates memory optimally — minimises peak memory usage

Typical Speedups vs PyTorch FP32

ONNX Runtime (FP32)

1.4×

ONNX Runtime (FP16)

2.1×

TensorRT (FP16)

3.5×

TensorRT (INT8)

5–8×

Python · PyTorch → TensorRT via torch2trt

from torch2trt import torch2trt

model = MyModel().eval().cuda()
dummy = torch.randn(1, 3, 224, 224).cuda()

# Compile with TensorRT (takes 1-5 minutes first time)
model_trt = torch2trt(
    model,
    [dummy],
    fp16_mode=True,      # Use FP16 Tensor Cores
    max_batch_size=32
)

# Save compiled engine
torch.save(model_trt.state_dict(), 'model_trt.pth')

# Inference is now 3-8× faster
output = model_trt(dummy)

Optimising Transformers/LLMs with ONNX

Python · Optimum — Hugging Face + ONNX

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

# Automatically export and optimise a HuggingFace model
model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    export=True
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("This is amazing!", return_tensors="pt")
outputs = model(**inputs)
# ~2-3× faster than vanilla PyTorch on CPU!

Frequently Asked Questions

Why doesn't TensorRT work with dynamic shapes?

TensorRT compiles kernels for specific input shapes. Dynamic shapes require compiling multiple profiles (min/opt/max shape), which increases compilation time and memory. For simplicity, fix the batch size if possible, or specify explicit shape ranges in the builder config.

How do I verify an ONNX export is correct?

Use onnx.checker.check_model(model) to validate the graph. Then compare outputs: run the same input through PyTorch and ONNX Runtime, and assert that outputs are close (within numerical precision): np.testing.assert_allclose(pytorch_out, onnx_out, rtol=1e-3, atol=1e-5).

Is TensorRT only for NVIDIA GPUs?

Yes. TensorRT is NVIDIA-proprietary. For AMD GPUs, use MIGraphX or ROCm. For Intel CPUs/NPUs, use OpenVINO. For Apple Silicon, use CoreML. ONNX Runtime supports all these backends, so exporting to ONNX first gives you the most portability.

Frequently Asked Questions

What will I learn here?

This page covers the core concepts and techniques you need to understand the topic and progress confidently to the next lesson.

How should I use this page?

Start with the overview, then follow the section links to deepen your understanding. Use the table of contents on the right to jump to specific sections.

What should I read next?

Use the navigation below to continue to the next lesson or explore related topics.