Adding GPU Workers for ML

Learn how to deploy GPU-enabled Concourse CI workers for machine learning and AI workloads

📚 What You'll Build: By the end of this tutorial, you'll have a Concourse CI worker with GPU support that can run PyTorch, TensorFlow, and other ML training tasks.

Why GPU Workers?

GPU-enabled workers let you run:

ML model training (PyTorch, TensorFlow, JAX)
GPU-accelerated builds (CUDA applications, ROCm workloads)
Parallel computation tasks (scientific computing, rendering)
AI inference pipelines (model serving, batch prediction)

The charm handles all GPU setup automatically - you just need to configure the hardware passthrough!

Before You Start

This tutorial requires:

A GPU in your machine (NVIDIA or AMD)
GPU drivers installed on the host (nvidia-smi or rocm-smi working)
Completed the first deployment tutorial
LXD environment (for localhost/LXD cloud)
~30 minutes of time

⚠️ GPU Requirement: You must have a physical GPU and working drivers. Virtual machines typically don't have GPU passthrough configured.

Step 1: Verify GPU on Host

First, let's make sure your GPU is visible on the host machine.

1 Check for NVIDIA GPU

nvidia-smi

Or for AMD GPU:

rocm-smi

You should see your GPU listed. If not, install drivers first!

💡 Example Output (NVIDIA): You should see your GPU model, driver version, and CUDA version. Something like "NVIDIA RTX A500" or "RTX 3070".

Step 2: Deploy PostgreSQL and Web Server

If you don't have a web server running yet, let's deploy one.

2 Create model and deploy prerequisites

juju add-model concourse-gpu

# Deploy database
juju deploy postgresql --channel 16/stable --base ubuntu@24.04

# Deploy web server
juju deploy concourse-ci-machine web \
  --config mode=web \
  --channel edge \
  --base ubuntu@24.04

# Connect them
juju integrate web:postgresql postgresql:database

Step 3: Deploy GPU-Enabled Worker

Now for the exciting part - deploying a GPU worker!

3 Deploy worker with GPU enabled (NVIDIA)

juju deploy concourse-ci-machine gpu-worker \
  --config mode=worker \
  --config compute-runtime=cuda \
  --channel edge \
  --base ubuntu@24.04

For AMD GPUs, use:

juju deploy concourse-ci-machine gpu-worker \
  --config mode=worker \
  --config compute-runtime=rocm \
  --channel edge \
  --base ubuntu@24.04

4 Connect worker to web server

juju integrate web:tsa gpu-worker:flight

This creates the TSA connection between web and worker.

Step 4: Add GPU to LXC Container

This is the crucial step! We need to pass the GPU hardware into the container.

5 Find the container name

juju status gpu-worker

Note the machine number (e.g., "4"). The container name is juju-<model-id>-<machine-num>.

6 List containers to find exact name

# List all containers - look for "juju-" prefix
lxc list

# Example output shows: juju-abc123-4 (your actual container)

Copy the container name from the output (e.g., juju-abc123-4).

7 Add GPU device to container

# Replace juju-xxxxx-4 with your actual container name
lxc config device add juju-xxxxx-4 gpu0 gpu

⚠️ Multi-GPU Systems: If you have both NVIDIA and AMD GPUs, specify which one:

# List available GPUs
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, driver_version, vendor_id, product_id}'

# Add specific GPU (e.g., GPU ID 1)
lxc config device add juju-xxxxx-4 gpu1 gpu id=1

For AMD ROCm GPUs, also add /dev/kfd:

lxc config device add juju-xxxxx-4 kfd unix-char \
  source=/dev/kfd \
  path=/dev/kfd

Step 5: Wait for Charm to Configure GPU

The charm will automatically detect the GPU and configure everything!

8 Watch the worker configure itself

juju status gpu-worker --watch 5s

Wait for the status to show:

Worker ready (v7.14.2) (GPU: 1x NVIDIA)

Or for AMD:

Worker ready (v7.14.2) (GPU: 1x AMD)

🎉 GPU Detected! The charm has automatically installed nvidia-container-toolkit (or amd-container-toolkit), configured the runtime, and tagged the worker with GPU capabilities.

Step 6: Test GPU Access

Let's verify the GPU is actually accessible in tasks.

9 Create a test pipeline

cat > gpu-test.yml <<'EOF'
jobs:
- name: gpu-check
  plan:
  - task: nvidia-smi
    tags: [cuda]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: nvidia/cuda
          tag: 12.1.0-base-ubuntu22.04
      run:
        path: nvidia-smi
EOF

For AMD, use this instead:

cat > gpu-test.yml <<'EOF'
jobs:
- name: gpu-check
  plan:
  - task: rocm-smi
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/dev-ubuntu-24.04
          tag: latest
      run:
        path: rocm-smi
EOF

Step 7: Run Your First GPU Task

10 Get web server IP and login

# Get IP
WEB_IP=$(juju status web/0 --format=json | jq -r '.applications.web.units["web/0"]["public-address"]')

# Get password
ADMIN_PASS=$(juju run web/leader get-admin-password --format=json | jq -r '."unit-web-0".results.password')

# Login
fly -t gpu login -c http://$WEB_IP:8080 -u admin -p "$ADMIN_PASS"

11 Set and run the pipeline

fly -t gpu set-pipeline -p gpu-test -c gpu-test.yml
fly -t gpu unpause-pipeline -p gpu-test
fly -t gpu trigger-job -j gpu-test/gpu-check -w

🚀 Success! If you see GPU information in the output (device name, driver version, memory), your GPU worker is fully operational!

Step 8: Run a Real ML Task

Let's run something more exciting - a PyTorch GPU test!

12 Create PyTorch test pipeline

cat > pytorch-test.yml <<'EOF'
jobs:
- name: pytorch-gpu-test
  plan:
  - task: test-pytorch
    tags: [cuda]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: pytorch/pytorch
          tag: latest
      run:
        path: python
        args:
          - -c
          - |
            import torch
            print(f"PyTorch version: {torch.__version__}")
            print(f"CUDA available: {torch.cuda.is_available()}")
            print(f"CUDA version: {torch.version.cuda}")
            if torch.cuda.is_available():
                print(f"GPU device: {torch.cuda.get_device_name(0)}")
                print(f"GPU count: {torch.cuda.device_count()}")
                # Test actual GPU computation
                x = torch.rand(5, 3).cuda()
                y = x * 2
                print(f"GPU tensor computation successful!")
                print(f"Result: {y}")
            else:
                print("ERROR: CUDA not available!")
                exit(1)
EOF

13 Run the PyTorch test

fly -t gpu set-pipeline -p pytorch-test -c pytorch-test.yml
fly -t gpu unpause-pipeline -p pytorch-test  
fly -t gpu trigger-job -j pytorch-test/pytorch-gpu-test -w

You should see output showing:

✅ CUDA is available
✅ Your GPU device name
✅ Successful tensor computation on GPU

What You've Accomplished

Congratulations! You now have:

✅ A GPU-enabled Concourse CI worker
✅ Automatic GPU device injection into tasks
✅ Worker tagged with GPU capabilities (cuda or rocm)
✅ Verified GPU access from PyTorch
✅ Ready to run ML training pipelines!

💡 What Happened Behind the Scenes:

Charm installed nvidia-container-toolkit or amd-container-toolkit
Created a custom OCI runtime wrapper
Configured containerd to inject GPU devices
Tagged the worker so pipelines can target it

Next Steps

Now that you have GPU workers, explore these advanced topics:

Mount datasets: Follow the Dataset Mounting tutorial to automatically inject training data
Scale GPU workers: Add more GPU workers for parallel training
Mixed worker fleet: Deploy both CPU and GPU workers, use tags to route tasks
Understanding the implementation: Read how GPU support works

Troubleshooting

Worker shows "GPU enabled but no GPU detected"

Solution: The LXC GPU device isn't added yet. Run step 7 again to add the GPU device to the container.

PyTorch says "CUDA not available"

Check these:

Is the GPU visible in container? lxc exec <container> -- nvidia-smi
Is nvidia-container-runtime installed? juju ssh gpu-worker/0 'which nvidia-container-runtime'
Check worker logs: juju debug-log --include gpu-worker/0

AMD GPU: "HSA_STATUS_ERROR_OUT_OF_RESOURCES"

For integrated AMD GPUs (APUs): Add this to your pipeline task:

env:
  HSA_OVERRIDE_GFX_VERSION: "11.0.0"  # For gfx1103/Phoenix1

Task can't find GPU devices

Verify:

Task uses tags: [cuda] or tags: [rocm]
Worker is registered: fly -t gpu workers should show GPU tags
Container has GPU device: lxc config device show <container>

🎓 Congratulations! You've successfully deployed GPU-enabled Concourse CI workers. You're ready to run ML training pipelines at scale!