Concourse CI Machine Charm - Documentation

Understanding GPU Architecture

Why GPU passthrough needs /dev/kfd, how OCI runtime wrappers work, and discrete vs integrated GPU differences

The Challenge: GPUs in Containers

Running GPU workloads in containers is fundamentally different from running them on bare metal. GPUs are hardware devices, not files or processes—you can't simply "copy" them into a container like you would a binary.

The core challenges:

Solution Overview: OCI Runtime Wrapper + Vendor Toolkits

The charm implements GPU support through two complementary mechanisms:

Component Purpose Installed When
Vendor Toolkit Injects GPU devices and libraries into containers compute-runtime=cuda or compute-runtime=rocm
OCI Runtime Wrapper Intercepts container creation to inject folder mounts (datasets, models) Always (for general folder mounting)

Architecture Flow

Concourse Worker (Host) - Container Creation Flow 1. containerd Receives "create container" request calls 2. runc (symlink) /var/lib/concourse/bin/runc → points to 3. OCI Runtime Wrapper Intercepts call, modifies config.json ✅ Injects /srv folder mounts (datasets, models, outputs) calls 4. Vendor Runtime (if GPU enabled) nvidia-container-runtime (CUDA) OR runc + CDI injection (ROCm) ✅ Injects GPU devices (/dev/nvidia*, /dev/kfd, /dev/dri/*) creates 5. Running Container ✅ GPU devices: /dev/nvidia0, /dev/nvidiactl, /dev/kfd ✅ GPU libraries: CUDA/ROCm runtime ✅ Dataset mounts: /srv/datasets, /srv/models, /srv/outputs Standard OCI container flow Folder Mount Injection GPU Device Injection Always runs compute-runtime = cuda/rocm

NVIDIA GPU Architecture (CUDA)

Required Devices

For NVIDIA GPUs to work in containers, three categories of devices must be passed through:

Device Path Purpose Required For
/dev/nvidia0, /dev/nvidia1, ... GPU device files (one per GPU) GPU compute operations
/dev/nvidiactl NVIDIA control device GPU initialization
/dev/nvidia-uvm Unified Virtual Memory CUDA memory management

nvidia-container-toolkit

NVIDIA provides nvidia-container-toolkit to automate device injection. It modifies the OCI spec to:

  1. Mount GPU devices into the container's /dev
  2. Bind-mount CUDA libraries from host (e.g., /usr/lib/x86_64-linux-gnu/libnvidia-*.so)
  3. Set environment variables (e.g., NVIDIA_VISIBLE_DEVICES)
  4. Configure ldconfig so container can find libraries
💡 Why Version Matching Matters: CUDA libraries in the container must be ABI-compatible with the host driver. This is why PyTorch images specify CUDA versions (e.g., pytorch/pytorch:2.0.1-cuda11.8). The container toolkit ensures the driver libraries match, while the container supplies the CUDA toolkit libraries.

AMD GPU Architecture (ROCm)

The Critical /dev/kfd Device

AMD ROCm's architecture differs significantly from NVIDIA. The most important component is /dev/kfd (Kernel Fusion Driver):

Device Purpose Impact if Missing
/dev/kfd GPU compute interface (HSA) ❌ No GPU compute - PyTorch/TensorFlow won't detect GPU
/dev/dri/cardN GPU device file Monitoring only (rocm-smi works, but compute doesn't)
/dev/dri/renderDN Render node (optional) Only needed for graphics workloads
⚠️ Common Mistake: Many users pass through /dev/dri/card* but forget /dev/kfd. This makes rocm-smi work (giving false confidence), but PyTorch/TensorFlow fail with "CUDA (ROCm) available: False" because they need KFD for compute operations.

Why /dev/kfd is Special

The Kernel Fusion Driver provides the HSA (Heterogeneous System Architecture) interface:

Without /dev/kfd, the GPU is essentially "view only"—you can query it, but not run compute kernels on it.

amd-container-toolkit

Similar to NVIDIA, AMD provides amd-container-toolkit (via amdgpu-dkms package). It:

  1. Generates CDI spec: Discovers AMD GPUs and writes /etc/cdi/amd.yaml
  2. Mounts ROCm libraries: Bind-mounts /opt/rocm libraries into container
  3. Sets HSA environment: Configures HSA_OVERRIDE_GFX_VERSION if needed

Discrete vs Integrated GPUs

Discrete GPUs (Recommended)

Examples: NVIDIA RTX 3070/4090, AMD RX 7900 XT, Radeon Pro W6800

Characteristic Impact on Concourse
Dedicated VRAM ✅ Predictable memory allocation, no contention with system RAM
PCI-e connection ✅ Full bandwidth for data transfers
Fully supported by ROCm/CUDA ✅ No workarounds needed
Better cooling ✅ Can sustain higher workloads

Integrated GPUs (Limited Support)

Examples: AMD Phoenix (Ryzen 7xxx laptops), Intel Iris Xe, NVIDIA Tegra

Characteristic Impact on Concourse
Shared system RAM ⚠️ Memory bandwidth bottleneck, competes with system processes
Often unofficial GFX versions ❌ ROCm rejects unsupported architectures (e.g., gfx1103 for Phoenix)
Lower power budget ⚠️ Thermal throttling under sustained load
Requires workarounds HSA_OVERRIDE_GFX_VERSION needed for AMD integrated GPUs

The HSA_OVERRIDE_GFX_VERSION Workaround

For integrated AMD GPUs (like Phoenix/gfx1103), ROCm doesn't officially support the architecture. However, you can force ROCm to use kernels from a similar architecture:

# In your pipeline's task config:
env:
  HSA_OVERRIDE_GFX_VERSION: "11.0.0"  # For gfx1103 (Phoenix)

run:
  path: python
  args:
    - train.py

Why this works: ROCm checks the GPU's GFX version and loads optimized kernels for that architecture. By overriding, you tell ROCm "pretend this is gfx1100 and use those kernels." It's suboptimal but functional.

Trade-off: This workaround enables compute but with 20-40% performance penalty compared to native support. Acceptable for development/testing, but not recommended for production ML training.

OCI Runtime Wrapper: How Folder Mounting Works

The charm installs a custom runc-wrapper script that intercepts container creation to inject folder mounts from /srv:

Installation Process

  1. Backup original runc: mv /usr/bin/runc /opt/bin/runc.real
  2. Install wrapper: Copy wrapper script to /opt/bin/runc-wrapper
  3. Symlink: ln -s /opt/bin/runc-wrapper /var/lib/concourse/bin/runc
  4. PATH priority: Worker config sets PATH=/opt/bin:... to find wrapper first

Runtime Behavior

When containerd creates a container, it calls runc create --bundle /path/to/bundle. The wrapper:

#!/bin/bash
# 1. Parse --bundle argument
BUNDLE="$bundle_path_from_args"

# 2. Discover folders in /srv
for folder in /srv/*; do
    FOLDER_NAME=$(basename "$folder")
    
    # 3. Determine read-only vs writable
    if [[ "$FOLDER_NAME" == *"_writable"* ]]; then
        MOUNT_OPTIONS="rbind,rw"
    else
        MOUNT_OPTIONS="rbind,ro"
    fi
    
    # 4. Inject mount into config.json
    jq '.mounts += [{
        "destination": "/srv/'$FOLDER_NAME'",
        "type": "bind",
        "source": "'$folder'",
        "options": ["'$MOUNT_OPTIONS'"]
    }]' "$BUNDLE/config.json" > "$BUNDLE/config.json.new"
    
    mv "$BUNDLE/config.json.new" "$BUNDLE/config.json"
done

# 5. Call real runc (or nvidia-container-runtime if GPU)
exec /opt/bin/runc.real "$@"

Key design choices:

GPU + Folder Mounting: The Combined Flow

When both GPU and folder mounting are enabled, the call chain is:

containerd
  ↓
/var/lib/concourse/bin/runc (symlink to wrapper)
  ↓
/opt/bin/runc-wrapper
  • Injects /srv folder mounts into config.json
  ↓
/opt/bin/runc.real → nvidia-container-runtime
  • Injects GPU devices + libraries
  ↓
Real runc
  • Creates container with GPU + datasets
Composability: The wrapper and GPU runtime are independent. You can use folder mounting without GPUs, or GPUs without folder mounting. They compose cleanly because each modifies a different part of the OCI spec (mounts vs devices).

Performance Considerations

Bind Mount Performance

Folder mounts use bind mounts, not copying:

For a 100GB dataset:

Copy-based approach: 100GB disk + 100GB disk + 5 minutes = 200GB total
Bind mount approach: 100GB disk + 0s = 100GB total

GPU Overhead

GPU passthrough via nvidia-container-toolkit or amd-container-toolkit has near-zero overhead because:

Benchmarks show container GPU performance is typically 98-100% of bare metal.

Debugging GPU Issues

NVIDIA: GPU Not Detected

# 1. Check devices exist on host
juju ssh worker/0
ls -la /dev/nvidia*
# Should show: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm

# 2. Verify nvidia-container-toolkit installed
which nvidia-container-runtime
# Should show: /usr/bin/nvidia-container-runtime

# 3. Check containerd config
sudo cat /etc/containerd/config.toml | grep nvidia
# Should show: default_runtime_name = "nvidia"

# 4. Test GPU in container manually
sudo ctr run --rm --runtime io.containerd.runc.v2 \
  docker.io/nvidia/cuda:13.1.0-base-ubuntu24.04 test nvidia-smi

AMD: /dev/kfd Missing

# 1. Check /dev/kfd exists
juju ssh worker/0
ls -la /dev/kfd
# Must exist! If not, amdgpu driver issue

# 2. Check permissions
ls -la /dev/kfd
# Should show: crw-rw-rw- 1 root render /dev/kfd

# 3. Verify amd-container-toolkit CDI spec
cat /etc/cdi/amd.yaml
# Should list GPU devices including kfd

# 4. Test in container
sudo ctr run --rm docker.io/rocm/pytorch:latest test \
  sh -c "ls -la /dev/kfd && python -c 'import torch; print(torch.cuda.is_available())'"

Related Topics