Configure GPU Workers

Enable NVIDIA CUDA or AMD ROCm GPU support on workers

Enable NVIDIA GPU

# Deploy with CUDA enabled
juju deploy concourse-ci-machine --channel edge worker \
  --config mode=worker \
  --config compute-runtime=cuda

# Or enable on existing worker
juju config worker compute-runtime=cuda

Add GPU to LXC Container

# Find container name (look for "juju-" prefix)
lxc list

# Add GPU device (replace with your actual container name)
lxc config device add juju-abc123-0 gpu0 gpu

# Verify
lxc exec juju-abc123-0 -- nvidia-smi

Enable AMD GPU (ROCm)

# Deploy with ROCm enabled
juju deploy concourse-ci-machine --channel edge worker \
  --config mode=worker \
  --config compute-runtime=rocm

# Or enable on existing worker
juju config worker compute-runtime=rocm

Add AMD GPU to LXC Container

# Query available GPUs (important for multi-GPU systems)
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, driver_version, vendor_id, product_id}'

# Add specific AMD GPU by ID (recommended)
lxc config device add <container-name> gpu1 gpu id=1

# Add /dev/kfd for compute workloads (REQUIRED)
lxc config device add <container-name> kfd unix-char \
  source=/dev/kfd \
  path=/dev/kfd

# Verify
lxc exec <container-name> -- rocm-smi

⚠️ Multi-GPU systems: Always use id=N to target specific AMD GPUs when multiple vendors are present. Without ID, all GPUs are passed through causing conflicts.

Disable GPU

juju config worker compute-runtime=none

Configure GPU Device Selection

Control which GPUs are exposed to tasks:

# Expose all GPUs (default)
juju config worker gpu-device-ids=all

# Expose specific GPUs
juju config worker gpu-device-ids="0,1"

Verify GPU Configuration

# Check worker status
juju status worker
# Should show: "Worker ready (GPU: 1x NVIDIA)" or "Worker ready (GPU: 1x AMD)"

# Check Concourse CI worker tags
juju ssh web/0
fly -t local workers
# Should show tags: cuda (or rocm), gpu-count=1

Test GPU in Pipeline

NVIDIA Test

jobs:
  - name: test-nvidia
    plan:
      - task: gpu-test
        tags: [cuda]
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: nvidia/cuda
              tag: 13.1.0-base-ubuntu24.04
          run:
            path: nvidia-smi

AMD Test

jobs:
  - name: test-amd
    plan:
      - task: gpu-test
        tags: [rocm]
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: rocm/dev-ubuntu-24.04
              tag: latest
          run:
            path: rocm-smi

Common Issues

NVIDIA: "GPU enabled but no GPU detected"

# Check host has GPU
nvidia-smi

# Check LXC device
lxc config device show <container-name>

# Check inside container
lxc exec <container-name> -- nvidia-smi

AMD: "CUDA (ROCm) available: False" in PyTorch

# 1. Verify /dev/kfd exists
lxc exec <container-name> -- ls -la /dev/kfd

# 2. If missing, add it
lxc config device add <container-name> kfd unix-char \
  source=/dev/kfd path=/dev/kfd

# 3. For integrated GPUs (Phoenix1/gfx1103), use override
# In your pipeline:
export HSA_OVERRIDE_GFX_VERSION=11.0.0

GPU Not Showing in Task

Ensure task uses tags: [cuda] or tags: [rocm]
Verify GPU-enabled image (nvidia/cuda or rocm/* base)
Check worker registered: fly -t local workers

Concourse CI Machine Charm