Understanding GPU Architecture

Why GPU passthrough needs /dev/kfd, how OCI runtime wrappers work, and discrete vs integrated GPU differences

The Challenge: GPUs in Containers

Running GPU workloads in containers is fundamentally different from running them on bare metal. GPUs are hardware devices, not files or processes—you can't simply "copy" them into a container like you would a binary.

The core challenges:

Device access: Containers have isolated device namespaces. GPUs live in /dev on the host.
Driver libraries: GPU code needs vendor-specific libraries (CUDA for NVIDIA, ROCm for AMD) that must match the host driver version.
Permissions: GPU devices typically require root or specific group membership to access.
Container runtime: Standard OCI runtimes (runc) don't natively understand GPU passthrough.

Solution Overview: OCI Runtime Wrapper + Vendor Toolkits

The charm implements GPU support through two complementary mechanisms:

Component	Purpose	Installed When
Vendor Toolkit	Injects GPU devices and libraries into containers	`compute-runtime=cuda` or `compute-runtime=rocm`
OCI Runtime Wrapper	Intercepts container creation to inject folder mounts (datasets, models)	Always (for general folder mounting)

Architecture Flow

NVIDIA GPU Architecture (CUDA)

Required Devices

For NVIDIA GPUs to work in containers, three categories of devices must be passed through:

Device Path	Purpose	Required For
`/dev/nvidia0`, `/dev/nvidia1`, ...	GPU device files (one per GPU)	GPU compute operations
`/dev/nvidiactl`	NVIDIA control device	GPU initialization
`/dev/nvidia-uvm`	Unified Virtual Memory	CUDA memory management

nvidia-container-toolkit

NVIDIA provides nvidia-container-toolkit to automate device injection. It modifies the OCI spec to:

Mount GPU devices into the container's /dev
Bind-mount CUDA libraries from host (e.g., /usr/lib/x86_64-linux-gnu/libnvidia-*.so)
Set environment variables (e.g., NVIDIA_VISIBLE_DEVICES)
Configure ldconfig so container can find libraries

💡 Why Version Matching Matters: CUDA libraries in the container must be ABI-compatible with the host driver. This is why PyTorch images specify CUDA versions (e.g., pytorch/pytorch:2.0.1-cuda11.8). The container toolkit ensures the driver libraries match, while the container supplies the CUDA toolkit libraries.

AMD GPU Architecture (ROCm)

The Critical /dev/kfd Device

AMD ROCm's architecture differs significantly from NVIDIA. The most important component is /dev/kfd (Kernel Fusion Driver):

Device	Purpose	Impact if Missing
`/dev/kfd`	GPU compute interface (HSA)	❌ No GPU compute - PyTorch/TensorFlow won't detect GPU
`/dev/dri/cardN`	GPU device file	Monitoring only (rocm-smi works, but compute doesn't)
`/dev/dri/renderDN`	Render node (optional)	Only needed for graphics workloads

⚠️ Common Mistake: Many users pass through /dev/dri/card* but forget /dev/kfd. This makes rocm-smi work (giving false confidence), but PyTorch/TensorFlow fail with "CUDA (ROCm) available: False" because they need KFD for compute operations.

Why /dev/kfd is Special

The Kernel Fusion Driver provides the HSA (Heterogeneous System Architecture) interface:

Unified memory model: Allows CPU and GPU to share memory pointers
Kernel scheduling: Manages GPU workload dispatch
Signal handling: Coordinates async operations between CPU and GPU

Without /dev/kfd, the GPU is essentially "view only"—you can query it, but not run compute kernels on it.

amd-container-toolkit

Similar to NVIDIA, AMD provides amd-container-toolkit (via amdgpu-dkms package). It:

Generates CDI spec: Discovers AMD GPUs and writes /etc/cdi/amd.yaml
Mounts ROCm libraries: Bind-mounts /opt/rocm libraries into container
Sets HSA environment: Configures HSA_OVERRIDE_GFX_VERSION if needed

Discrete vs Integrated GPUs

Discrete GPUs (Recommended)

Examples: NVIDIA RTX 3070/4090, AMD RX 7900 XT, Radeon Pro W6800

Characteristic	Impact on Concourse
Dedicated VRAM	✅ Predictable memory allocation, no contention with system RAM
PCI-e connection	✅ Full bandwidth for data transfers
Fully supported by ROCm/CUDA	✅ No workarounds needed
Better cooling	✅ Can sustain higher workloads

Integrated GPUs (Limited Support)

Examples: AMD Phoenix (Ryzen 7xxx laptops), Intel Iris Xe, NVIDIA Tegra

Characteristic	Impact on Concourse
Shared system RAM	⚠️ Memory bandwidth bottleneck, competes with system processes
Often unofficial GFX versions	❌ ROCm rejects unsupported architectures (e.g., gfx1103 for Phoenix)
Lower power budget	⚠️ Thermal throttling under sustained load
Requires workarounds	❌ `HSA_OVERRIDE_GFX_VERSION` needed for AMD integrated GPUs

The HSA_OVERRIDE_GFX_VERSION Workaround

For integrated AMD GPUs (like Phoenix/gfx1103), ROCm doesn't officially support the architecture. However, you can force ROCm to use kernels from a similar architecture:

# In your pipeline's task config:
env:
  HSA_OVERRIDE_GFX_VERSION: "11.0.0"  # For gfx1103 (Phoenix)

run:
  path: python
  args:
    - train.py

Why this works: ROCm checks the GPU's GFX version and loads optimized kernels for that architecture. By overriding, you tell ROCm "pretend this is gfx1100 and use those kernels." It's suboptimal but functional.

Trade-off: This workaround enables compute but with 20-40% performance penalty compared to native support. Acceptable for development/testing, but not recommended for production ML training.

OCI Runtime Wrapper: How Folder Mounting Works

The charm installs a custom runc-wrapper script that intercepts container creation to inject folder mounts from /srv:

Installation Process

Backup original runc: mv /usr/bin/runc /opt/bin/runc.real
Install wrapper: Copy wrapper script to /opt/bin/runc-wrapper
Symlink: ln -s /opt/bin/runc-wrapper /var/lib/concourse/bin/runc
PATH priority: Worker config sets PATH=/opt/bin:... to find wrapper first

Runtime Behavior

When containerd creates a container, it calls runc create --bundle /path/to/bundle. The wrapper:

#!/bin/bash
# 1. Parse --bundle argument
BUNDLE="$bundle_path_from_args"

# 2. Discover folders in /srv
for folder in /srv/*; do
    FOLDER_NAME=$(basename "$folder")
    
    # 3. Determine read-only vs writable
    if [[ "$FOLDER_NAME" == *"_writable"* ]]; then
        MOUNT_OPTIONS="rbind,rw"
    else
        MOUNT_OPTIONS="rbind,ro"
    fi
    
    # 4. Inject mount into config.json
    jq '.mounts += [{
        "destination": "/srv/'$FOLDER_NAME'",
        "type": "bind",
        "source": "'$folder'",
        "options": ["'$MOUNT_OPTIONS'"]
    }]' "$BUNDLE/config.json" > "$BUNDLE/config.json.new"
    
    mv "$BUNDLE/config.json.new" "$BUNDLE/config.json"
done

# 5. Call real runc (or nvidia-container-runtime if GPU)
exec /opt/bin/runc.real "$@"

Key design choices:

Zero-configuration: Workers automatically discover folders, no pipeline changes
Read-only by default: Data safety—tasks can't corrupt datasets
Writable suffix: _writable or _rw enables writes for model outputs
Transparent: Tasks see mounts as if they were baked into the image

GPU + Folder Mounting: The Combined Flow

When both GPU and folder mounting are enabled, the call chain is:

containerd
  ↓
/var/lib/concourse/bin/runc (symlink to wrapper)
  ↓
/opt/bin/runc-wrapper
  • Injects /srv folder mounts into config.json
  ↓
/opt/bin/runc.real → nvidia-container-runtime
  • Injects GPU devices + libraries
  ↓
Real runc
  • Creates container with GPU + datasets

Composability: The wrapper and GPU runtime are independent. You can use folder mounting without GPUs, or GPUs without folder mounting. They compose cleanly because each modifies a different part of the OCI spec (mounts vs devices).

Performance Considerations

Bind Mount Performance

Folder mounts use bind mounts, not copying:

Zero copy overhead: Files aren't duplicated into container
Same inode: Host and container see the same data
Page cache shared: OS caches data once, benefits both host and container

For a 100GB dataset:

Copy-based approach: 100GB disk + 100GB disk + 5 minutes = 200GB total
Bind mount approach: 100GB disk + 0s = 100GB total

GPU Overhead

GPU passthrough via nvidia-container-toolkit or amd-container-toolkit has near-zero overhead because:

GPU devices are passed through at kernel level (no emulation)
Libraries are bind-mounted (no copying)
Compute happens directly on GPU hardware

Benchmarks show container GPU performance is typically 98-100% of bare metal.

Debugging GPU Issues

NVIDIA: GPU Not Detected

# 1. Check devices exist on host
juju ssh worker/0
ls -la /dev/nvidia*
# Should show: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm

# 2. Verify nvidia-container-toolkit installed
which nvidia-container-runtime
# Should show: /usr/bin/nvidia-container-runtime

# 3. Check containerd config
sudo cat /etc/containerd/config.toml | grep nvidia
# Should show: default_runtime_name = "nvidia"

# 4. Test GPU in container manually
sudo ctr run --rm --runtime io.containerd.runc.v2 \
  docker.io/nvidia/cuda:13.1.0-base-ubuntu24.04 test nvidia-smi

AMD: /dev/kfd Missing

# 1. Check /dev/kfd exists
juju ssh worker/0
ls -la /dev/kfd
# Must exist! If not, amdgpu driver issue

# 2. Check permissions
ls -la /dev/kfd
# Should show: crw-rw-rw- 1 root render /dev/kfd

# 3. Verify amd-container-toolkit CDI spec
cat /etc/cdi/amd.yaml
# Should list GPU devices including kfd

# 4. Test in container
sudo ctr run --rm docker.io/rocm/pytorch:latest test \
  sh -c "ls -la /dev/kfd && python -c 'import torch; print(torch.cuda.is_available())'"

Concourse CI Machine Charm - Documentation