Adding GPU Workers for ML
Learn how to deploy GPU-enabled Concourse CI workers for machine learning and AI workloads
Why GPU Workers?
GPU-enabled workers let you run:
- ML model training (PyTorch, TensorFlow, JAX)
- GPU-accelerated builds (CUDA applications, ROCm workloads)
- Parallel computation tasks (scientific computing, rendering)
- AI inference pipelines (model serving, batch prediction)
The charm handles all GPU setup automatically - you just need to configure the hardware passthrough!
Before You Start
This tutorial requires:
- A GPU in your machine (NVIDIA or AMD)
- GPU drivers installed on the host (nvidia-smi or rocm-smi working)
- Completed the first deployment tutorial
- LXD environment (for localhost/LXD cloud)
- ~30 minutes of time
Step 1: Verify GPU on Host
First, let's make sure your GPU is visible on the host machine.
1 Check for NVIDIA GPU
nvidia-smi
Or for AMD GPU:
rocm-smi
You should see your GPU listed. If not, install drivers first!
Step 2: Deploy PostgreSQL and Web Server
If you don't have a web server running yet, let's deploy one.
2 Create model and deploy prerequisites
juju add-model concourse-gpu
# Deploy database
juju deploy postgresql --channel 16/stable --base ubuntu@24.04
# Deploy web server
juju deploy concourse-ci-machine web \
--config mode=web \
--channel edge \
--base ubuntu@24.04
# Connect them
juju integrate web:postgresql postgresql:database
Step 3: Deploy GPU-Enabled Worker
Now for the exciting part - deploying a GPU worker!
3 Deploy worker with GPU enabled (NVIDIA)
juju deploy concourse-ci-machine gpu-worker \
--config mode=worker \
--config compute-runtime=cuda \
--channel edge \
--base ubuntu@24.04
For AMD GPUs, use:
juju deploy concourse-ci-machine gpu-worker \
--config mode=worker \
--config compute-runtime=rocm \
--channel edge \
--base ubuntu@24.04
4 Connect worker to web server
juju integrate web:tsa gpu-worker:flight
This creates the TSA connection between web and worker.
Step 4: Add GPU to LXC Container
This is the crucial step! We need to pass the GPU hardware into the container.
5 Find the container name
juju status gpu-worker
Note the machine number (e.g., "4"). The container name is juju-<model-id>-<machine-num>.
6 List containers to find exact name
# List all containers - look for "juju-" prefix
lxc list
# Example output shows: juju-abc123-4 (your actual container)
Copy the container name from the output (e.g., juju-abc123-4).
7 Add GPU device to container
# Replace juju-xxxxx-4 with your actual container name
lxc config device add juju-xxxxx-4 gpu0 gpu
# List available GPUs
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, driver_version, vendor_id, product_id}'
# Add specific GPU (e.g., GPU ID 1)
lxc config device add juju-xxxxx-4 gpu1 gpu id=1
For AMD ROCm GPUs, also add /dev/kfd:
lxc config device add juju-xxxxx-4 kfd unix-char \
source=/dev/kfd \
path=/dev/kfd
Step 5: Wait for Charm to Configure GPU
The charm will automatically detect the GPU and configure everything!
8 Watch the worker configure itself
juju status gpu-worker --watch 5s
Wait for the status to show:
Worker ready (v7.14.2) (GPU: 1x NVIDIA)
Or for AMD:
Worker ready (v7.14.2) (GPU: 1x AMD)
Step 6: Test GPU Access
Let's verify the GPU is actually accessible in tasks.
9 Create a test pipeline
cat > gpu-test.yml <<'EOF'
jobs:
- name: gpu-check
plan:
- task: nvidia-smi
tags: [cuda]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: nvidia/cuda
tag: 12.1.0-base-ubuntu22.04
run:
path: nvidia-smi
EOF
For AMD, use this instead:
cat > gpu-test.yml <<'EOF'
jobs:
- name: gpu-check
plan:
- task: rocm-smi
tags: [rocm]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: rocm/dev-ubuntu-24.04
tag: latest
run:
path: rocm-smi
EOF
Step 7: Run Your First GPU Task
10 Get web server IP and login
# Get IP
WEB_IP=$(juju status web/0 --format=json | jq -r '.applications.web.units["web/0"]["public-address"]')
# Get password
ADMIN_PASS=$(juju run web/leader get-admin-password --format=json | jq -r '."unit-web-0".results.password')
# Login
fly -t gpu login -c http://$WEB_IP:8080 -u admin -p "$ADMIN_PASS"
11 Set and run the pipeline
fly -t gpu set-pipeline -p gpu-test -c gpu-test.yml
fly -t gpu unpause-pipeline -p gpu-test
fly -t gpu trigger-job -j gpu-test/gpu-check -w
Step 8: Run a Real ML Task
Let's run something more exciting - a PyTorch GPU test!
12 Create PyTorch test pipeline
cat > pytorch-test.yml <<'EOF'
jobs:
- name: pytorch-gpu-test
plan:
- task: test-pytorch
tags: [cuda]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: python
args:
- -c
- |
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
if torch.cuda.is_available():
print(f"GPU device: {torch.cuda.get_device_name(0)}")
print(f"GPU count: {torch.cuda.device_count()}")
# Test actual GPU computation
x = torch.rand(5, 3).cuda()
y = x * 2
print(f"GPU tensor computation successful!")
print(f"Result: {y}")
else:
print("ERROR: CUDA not available!")
exit(1)
EOF
13 Run the PyTorch test
fly -t gpu set-pipeline -p pytorch-test -c pytorch-test.yml
fly -t gpu unpause-pipeline -p pytorch-test
fly -t gpu trigger-job -j pytorch-test/pytorch-gpu-test -w
You should see output showing:
- ✅ CUDA is available
- ✅ Your GPU device name
- ✅ Successful tensor computation on GPU
What You've Accomplished
Congratulations! You now have:
- ✅ A GPU-enabled Concourse CI worker
- ✅ Automatic GPU device injection into tasks
- ✅ Worker tagged with GPU capabilities (
cudaorrocm) - ✅ Verified GPU access from PyTorch
- ✅ Ready to run ML training pipelines!
- Charm installed
nvidia-container-toolkitoramd-container-toolkit - Created a custom OCI runtime wrapper
- Configured containerd to inject GPU devices
- Tagged the worker so pipelines can target it
Next Steps
Now that you have GPU workers, explore these advanced topics:
- Mount datasets: Follow the Dataset Mounting tutorial to automatically inject training data
- Scale GPU workers: Add more GPU workers for parallel training
- Mixed worker fleet: Deploy both CPU and GPU workers, use tags to route tasks
- Understanding the implementation: Read how GPU support works
Troubleshooting
Worker shows "GPU enabled but no GPU detected"
Solution: The LXC GPU device isn't added yet. Run step 7 again to add the GPU device to the container.
PyTorch says "CUDA not available"
Check these:
- Is the GPU visible in container?
lxc exec <container> -- nvidia-smi - Is nvidia-container-runtime installed?
juju ssh gpu-worker/0 'which nvidia-container-runtime' - Check worker logs:
juju debug-log --include gpu-worker/0
AMD GPU: "HSA_STATUS_ERROR_OUT_OF_RESOURCES"
For integrated AMD GPUs (APUs): Add this to your pipeline task:
env:
HSA_OVERRIDE_GFX_VERSION: "11.0.0" # For gfx1103/Phoenix1
Task can't find GPU devices
Verify:
- Task uses
tags: [cuda]ortags: [rocm] - Worker is registered:
fly -t gpu workersshould show GPU tags - Container has GPU device:
lxc config device show <container>