Understanding GPU Architecture
Why GPU passthrough needs /dev/kfd, how OCI runtime wrappers work, and discrete vs integrated GPU differences
The Challenge: GPUs in Containers
Running GPU workloads in containers is fundamentally different from running them on bare metal. GPUs are hardware devices, not files or processes—you can't simply "copy" them into a container like you would a binary.
The core challenges:
- Device access: Containers have isolated device namespaces. GPUs live in
/devon the host. - Driver libraries: GPU code needs vendor-specific libraries (CUDA for NVIDIA, ROCm for AMD) that must match the host driver version.
- Permissions: GPU devices typically require root or specific group membership to access.
- Container runtime: Standard OCI runtimes (runc) don't natively understand GPU passthrough.
Solution Overview: OCI Runtime Wrapper + Vendor Toolkits
The charm implements GPU support through two complementary mechanisms:
| Component | Purpose | Installed When |
|---|---|---|
| Vendor Toolkit | Injects GPU devices and libraries into containers | compute-runtime=cuda or compute-runtime=rocm |
| OCI Runtime Wrapper | Intercepts container creation to inject folder mounts (datasets, models) | Always (for general folder mounting) |
Architecture Flow
NVIDIA GPU Architecture (CUDA)
Required Devices
For NVIDIA GPUs to work in containers, three categories of devices must be passed through:
| Device Path | Purpose | Required For |
|---|---|---|
/dev/nvidia0, /dev/nvidia1, ... |
GPU device files (one per GPU) | GPU compute operations |
/dev/nvidiactl |
NVIDIA control device | GPU initialization |
/dev/nvidia-uvm |
Unified Virtual Memory | CUDA memory management |
nvidia-container-toolkit
NVIDIA provides nvidia-container-toolkit to automate device injection. It modifies the OCI spec to:
- Mount GPU devices into the container's
/dev - Bind-mount CUDA libraries from host (e.g.,
/usr/lib/x86_64-linux-gnu/libnvidia-*.so) - Set environment variables (e.g.,
NVIDIA_VISIBLE_DEVICES) - Configure ldconfig so container can find libraries
pytorch/pytorch:2.0.1-cuda11.8). The container toolkit ensures the driver libraries match, while the container supplies the CUDA toolkit libraries.
AMD GPU Architecture (ROCm)
The Critical /dev/kfd Device
AMD ROCm's architecture differs significantly from NVIDIA. The most important component is /dev/kfd (Kernel Fusion Driver):
| Device | Purpose | Impact if Missing |
|---|---|---|
/dev/kfd |
GPU compute interface (HSA) | ❌ No GPU compute - PyTorch/TensorFlow won't detect GPU |
/dev/dri/cardN |
GPU device file | Monitoring only (rocm-smi works, but compute doesn't) |
/dev/dri/renderDN |
Render node (optional) | Only needed for graphics workloads |
/dev/dri/card* but forget /dev/kfd. This makes rocm-smi work (giving false confidence), but PyTorch/TensorFlow fail with "CUDA (ROCm) available: False" because they need KFD for compute operations.
Why /dev/kfd is Special
The Kernel Fusion Driver provides the HSA (Heterogeneous System Architecture) interface:
- Unified memory model: Allows CPU and GPU to share memory pointers
- Kernel scheduling: Manages GPU workload dispatch
- Signal handling: Coordinates async operations between CPU and GPU
Without /dev/kfd, the GPU is essentially "view only"—you can query it, but not run compute kernels on it.
amd-container-toolkit
Similar to NVIDIA, AMD provides amd-container-toolkit (via amdgpu-dkms package). It:
- Generates CDI spec: Discovers AMD GPUs and writes
/etc/cdi/amd.yaml - Mounts ROCm libraries: Bind-mounts
/opt/rocmlibraries into container - Sets HSA environment: Configures
HSA_OVERRIDE_GFX_VERSIONif needed
Discrete vs Integrated GPUs
Discrete GPUs (Recommended)
Examples: NVIDIA RTX 3070/4090, AMD RX 7900 XT, Radeon Pro W6800
| Characteristic | Impact on Concourse |
|---|---|
| Dedicated VRAM | ✅ Predictable memory allocation, no contention with system RAM |
| PCI-e connection | ✅ Full bandwidth for data transfers |
| Fully supported by ROCm/CUDA | ✅ No workarounds needed |
| Better cooling | ✅ Can sustain higher workloads |
Integrated GPUs (Limited Support)
Examples: AMD Phoenix (Ryzen 7xxx laptops), Intel Iris Xe, NVIDIA Tegra
| Characteristic | Impact on Concourse |
|---|---|
| Shared system RAM | ⚠️ Memory bandwidth bottleneck, competes with system processes |
| Often unofficial GFX versions | ❌ ROCm rejects unsupported architectures (e.g., gfx1103 for Phoenix) |
| Lower power budget | ⚠️ Thermal throttling under sustained load |
| Requires workarounds | ❌ HSA_OVERRIDE_GFX_VERSION needed for AMD integrated GPUs |
The HSA_OVERRIDE_GFX_VERSION Workaround
For integrated AMD GPUs (like Phoenix/gfx1103), ROCm doesn't officially support the architecture. However, you can force ROCm to use kernels from a similar architecture:
# In your pipeline's task config:
env:
HSA_OVERRIDE_GFX_VERSION: "11.0.0" # For gfx1103 (Phoenix)
run:
path: python
args:
- train.py
Why this works: ROCm checks the GPU's GFX version and loads optimized kernels for that architecture. By overriding, you tell ROCm "pretend this is gfx1100 and use those kernels." It's suboptimal but functional.
OCI Runtime Wrapper: How Folder Mounting Works
The charm installs a custom runc-wrapper script that intercepts container creation to inject folder mounts from /srv:
Installation Process
- Backup original runc:
mv /usr/bin/runc /opt/bin/runc.real - Install wrapper: Copy wrapper script to
/opt/bin/runc-wrapper - Symlink:
ln -s /opt/bin/runc-wrapper /var/lib/concourse/bin/runc - PATH priority: Worker config sets
PATH=/opt/bin:...to find wrapper first
Runtime Behavior
When containerd creates a container, it calls runc create --bundle /path/to/bundle. The wrapper:
#!/bin/bash
# 1. Parse --bundle argument
BUNDLE="$bundle_path_from_args"
# 2. Discover folders in /srv
for folder in /srv/*; do
FOLDER_NAME=$(basename "$folder")
# 3. Determine read-only vs writable
if [[ "$FOLDER_NAME" == *"_writable"* ]]; then
MOUNT_OPTIONS="rbind,rw"
else
MOUNT_OPTIONS="rbind,ro"
fi
# 4. Inject mount into config.json
jq '.mounts += [{
"destination": "/srv/'$FOLDER_NAME'",
"type": "bind",
"source": "'$folder'",
"options": ["'$MOUNT_OPTIONS'"]
}]' "$BUNDLE/config.json" > "$BUNDLE/config.json.new"
mv "$BUNDLE/config.json.new" "$BUNDLE/config.json"
done
# 5. Call real runc (or nvidia-container-runtime if GPU)
exec /opt/bin/runc.real "$@"
Key design choices:
- Zero-configuration: Workers automatically discover folders, no pipeline changes
- Read-only by default: Data safety—tasks can't corrupt datasets
- Writable suffix:
_writableor_rwenables writes for model outputs - Transparent: Tasks see mounts as if they were baked into the image
GPU + Folder Mounting: The Combined Flow
When both GPU and folder mounting are enabled, the call chain is:
containerd
↓
/var/lib/concourse/bin/runc (symlink to wrapper)
↓
/opt/bin/runc-wrapper
• Injects /srv folder mounts into config.json
↓
/opt/bin/runc.real → nvidia-container-runtime
• Injects GPU devices + libraries
↓
Real runc
• Creates container with GPU + datasets
Performance Considerations
Bind Mount Performance
Folder mounts use bind mounts, not copying:
- Zero copy overhead: Files aren't duplicated into container
- Same inode: Host and container see the same data
- Page cache shared: OS caches data once, benefits both host and container
For a 100GB dataset:
Copy-based approach: 100GB disk + 100GB disk + 5 minutes = 200GB total
Bind mount approach: 100GB disk + 0s = 100GB total
GPU Overhead
GPU passthrough via nvidia-container-toolkit or amd-container-toolkit has near-zero overhead because:
- GPU devices are passed through at kernel level (no emulation)
- Libraries are bind-mounted (no copying)
- Compute happens directly on GPU hardware
Benchmarks show container GPU performance is typically 98-100% of bare metal.
Debugging GPU Issues
NVIDIA: GPU Not Detected
# 1. Check devices exist on host
juju ssh worker/0
ls -la /dev/nvidia*
# Should show: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm
# 2. Verify nvidia-container-toolkit installed
which nvidia-container-runtime
# Should show: /usr/bin/nvidia-container-runtime
# 3. Check containerd config
sudo cat /etc/containerd/config.toml | grep nvidia
# Should show: default_runtime_name = "nvidia"
# 4. Test GPU in container manually
sudo ctr run --rm --runtime io.containerd.runc.v2 \
docker.io/nvidia/cuda:13.1.0-base-ubuntu24.04 test nvidia-smi
AMD: /dev/kfd Missing
# 1. Check /dev/kfd exists
juju ssh worker/0
ls -la /dev/kfd
# Must exist! If not, amdgpu driver issue
# 2. Check permissions
ls -la /dev/kfd
# Should show: crw-rw-rw- 1 root render /dev/kfd
# 3. Verify amd-container-toolkit CDI spec
cat /etc/cdi/amd.yaml
# Should list GPU devices including kfd
# 4. Test in container
sudo ctr run --rm docker.io/rocm/pytorch:latest test \
sh -c "ls -la /dev/kfd && python -c 'import torch; print(torch.cuda.is_available())'"
Related Topics
- Tutorial: GPU-Enabled Workers - Step-by-step GPU setup
- Tutorial: Dataset Mounting - Using folder mounting with ML workloads
- How-To: How to Configure GPU Workers - Quick GPU configuration
- Reference: ROCm Verification Guide - AMD GPU troubleshooting