AMD GPU support, verification commands, and troubleshooting
This reference provides complete specifications for verifying AMD GPU support in Concourse CI workers. ROCm (Radeon Open Compute) is AMD's platform for GPU compute, supporting machine learning frameworks like PyTorch and TensorFlow.
amdgpu kernel module, and /dev/kfd device access. Integrated AMD GPUs (APUs) require additional workarounds (see below).
| GPU Type | Support Level | Workaround Required | Production Ready |
|---|---|---|---|
| Discrete GPUs (RX 6000/7000, Radeon Pro, Instinct MI) |
Full Native Support | No | ✅ Yes |
| Integrated GPUs (APUs) (Phoenix1/gfx1103, Renoir/gfx90c, Cezanne/gfx90c) |
Experimental with Workaround | Yes (HSA_OVERRIDE_GFX_VERSION) |
❌ No (Dev/Test only) |
HSA_OVERRIDE_GFX_VERSION=11.0.0HSA_OVERRIDE_GFX_VERSION=9.0.0HSA_OVERRIDE_GFX_VERSION=9.0.0Run these commands on the host machine to verify AMD GPU hardware and drivers.
# List AMD GPUs
lspci | grep -i amd
# Expected output (example):
# 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT]
# 06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (integrated)
# Verify amdgpu kernel module loaded
lsmod | grep amdgpu
# Expected output:
# amdgpu 12345678 0
# drm_ttm_helper 16384 1 amdgpu
# ...
# List DRM devices
ls -la /dev/dri/
# Expected output:
# crw-rw----+ 1 root video 226, 0 Feb 4 09:00 card0
# crw-rw----+ 1 root video 226, 1 Feb 4 09:00 card1
# crw-rw----+ 1 root render 226, 128 Feb 4 09:00 renderD128
# crw-rw----+ 1 root render 226, 129 Feb 4 09:00 renderD129
# Verify KFD device exists
ls -la /dev/kfd
# Expected output:
# crw-rw-rw- 1 root root 236, 0 Feb 4 09:00 /dev/kfd
/dev/kfd (Kernel Fusion Driver) is required for ROCm compute workloads. PyTorch and TensorFlow will not detect the GPU without this device. rocm-smi works without it (monitoring only), but compute operations fail.
# Query all GPU cards with detailed information
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, driver_version, vendor_id, product_id}'
# Expected output (example):
# {
# "id": 0,
# "driver": "nvidia",
# "driver_version": "580.95",
# "vendor_id": "10de",
# "product_id": "2484"
# }
# {
# "id": 1,
# "driver": "amdgpu",
# "driver_version": "5.15.0-97-generic",
# "vendor_id": "1002",
# "product_id": "744c"
# }
Run these commands inside the Concourse worker container to verify GPU passthrough.
# SSH into worker unit
juju ssh worker/0
# List DRM devices
ls -la /dev/dri/
# Expected output (same devices as host):
# crw-rw----+ 1 root video 226, 0 Feb 4 09:00 card0
# crw-rw----+ 1 root render 226, 128 Feb 4 09:00 renderD128
# Inside worker container
ls -la /dev/kfd
# Expected output:
# crw-rw-rw- 1 root root 236, 0 Feb 4 09:00 /dev/kfd
/dev/kfd is often missing in containers even when /dev/dri/* devices are present. This causes PyTorch to report "CUDA (ROCm) available: False". Solution: lxc config device add <container> kfd unix-char source=/dev/kfd path=/dev/kfd
# Inside worker container
which rocm-smi
# Expected output:
# /opt/rocm/bin/rocm-smi
# Inside worker container
rocm-smi
# Expected output (example):
# ======================= ROCm System Management Interface =======================
# ================================= Concise Info =================================
# GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
# 0 35.0c 20.0W 800Mhz 1000Mhz 0% auto 203.0W 0% 0%
# ================================================================================
Verify GPU access from within a Concourse CI task container.
jobs:
- name: verify-rocm-discrete
plan:
- task: test-gpu
tags: [rocm]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: rocm/pytorch
tag: latest
run:
path: sh
args:
- -c
- |
# Check ROCm availability
rocm-smi
# Check devices
ls -la /dev/dri/ /dev/kfd
# PyTorch GPU test
python3 -c "
import torch
print('PyTorch version:', torch.__version__)
print('CUDA (ROCm) available:', torch.cuda.is_available())
print('GPU count:', torch.cuda.device_count())
if torch.cuda.is_available():
print('GPU name:', torch.cuda.get_device_name(0))
x = torch.rand(5, 3).cuda()
y = x * 2
print('GPU computation succeeded!')
print('Result:', y)
"
jobs:
- name: verify-rocm-integrated
plan:
- task: test-gpu
tags: [rocm]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: rocm/pytorch
tag: latest
run:
path: sh
args:
- -c
- |
# Set override for gfx1103 (Phoenix1 APU)
export HSA_OVERRIDE_GFX_VERSION=11.0.0
# Check GPU architecture
rocm-smi --showproductname
# PyTorch GPU test
python3 -c "
import torch
print('PyTorch version:', torch.__version__)
print('CUDA (ROCm) available:', torch.cuda.is_available())
if torch.cuda.is_available():
print('GPU name:', torch.cuda.get_device_name(0))
x = torch.rand(5, 3).cuda()
y = x * 2
print('GPU computation succeeded!')
print('Result:', y)
else:
print('ERROR: GPU not detected. Check /dev/kfd and HSA_OVERRIDE_GFX_VERSION.')
"
Integrated AMD GPUs (APUs) use GFX architectures not officially supported by ROCm. ROCm checks the GPU's GFX version and rejects unsupported versions. The HSA_OVERRIDE_GFX_VERSION environment variable tells ROCm to use compute kernels from a supported architecture instead.
| GPU Architecture | GFX Version | Override Value | Examples |
|---|---|---|---|
| Phoenix1 (RDNA 3) | gfx1103 | 11.0.0 |
Ryzen 7 7840HS (780M iGPU) |
| Renoir (Zen 2) | gfx90c | 9.0.0 |
Ryzen 4000 series (Vega iGPU) |
| Cezanne (Zen 3) | gfx90c | 9.0.0 |
Ryzen 5000 series (Vega iGPU) |
Add export HSA_OVERRIDE_GFX_VERSION=<value> at the beginning of your task script:
run:
path: sh
args:
- -c
- |
# Set override BEFORE importing PyTorch/TensorFlow
export HSA_OVERRIDE_GFX_VERSION=11.0.0
# Your GPU workload
python3 train.py --use-gpu
# Test integrated GPU with Docker before deploying pipeline
docker run --rm -it --device=/dev/kfd --device=/dev/dri \
rocm/pytorch:latest sh -c "
export HSA_OVERRIDE_GFX_VERSION=11.0.0
python3 -c 'import torch; print(torch.cuda.is_available()); x = torch.rand(5,3).cuda(); print(x * 2)'
"
Symptom: Charm status shows "GPU enabled" but worker doesn't detect GPU.
Causes & Solutions:
| Cause | Verification Command | Solution |
|---|---|---|
| No AMD GPU hardware | lspci | grep -i amd |
Verify GPU is installed and recognized by host |
| amdgpu driver not loaded | lsmod | grep amdgpu |
modprobe amdgpu |
| Missing /dev/dri/ devices | ls -la /dev/dri/ |
Check driver installation, reboot if necessary |
Symptom: torch.cuda.is_available() returns False in task containers.
Causes & Solutions (in order of likelihood):
| Cause | Verification Command | Solution |
|---|---|---|
| Missing /dev/kfd (most common) | juju ssh worker/0 -- ls -la /dev/kfd |
lxc config device add <container> kfd unix-char source=/dev/kfd path=/dev/kfd |
| Integrated GPU without override | rocm-smi --showproductname |
Add export HSA_OVERRIDE_GFX_VERSION=11.0.0 to task script (adjust version for your GPU) |
| Wrong LXC GPU passthrough | lxc config device show <container> |
Use lxc config device add <container> gpu1 gpu id=1 to target specific AMD GPU (not generic gpu) |
| Unsupported GPU | cat /sys/class/drm/card*/device/uevent | grep PCI_ID |
Check GPU compatibility: ROCm GPU Support |
Symptom: rocm-smi shows GPU info, but PyTorch/TensorFlow can't use GPU.
Root Cause: /dev/kfd is missing or inaccessible.
Explanation:
rocm-smi only needs /dev/dri/* devices for monitoring (temperature, clock speeds, etc.)/dev/kfd for GPU compute operations (kernel submission, memory management)Solution:
# Add /dev/kfd device to LXC container
lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd
# Restart worker service
juju ssh worker/0 -- sudo systemctl restart concourse-worker
Symptom: PyTorch raises HSA_STATUS_ERROR_OUT_OF_RESOURCES exception.
Causes & Solutions:
HSA_OVERRIDE_GFX_VERSION (see above).Symptom: Worker detects NVIDIA GPU when AMD GPU is desired (or vice versa).
Cause: Generic lxc config device add ... gpu passes all GPUs to container.
Solution: Use specific GPU ID:
# Query GPU IDs
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, vendor_id, product_id}'
# Output example:
# {"id": 0, "driver": "nvidia", "vendor_id": "10de", "product_id": "2484"}
# {"id": 1, "driver": "amdgpu", "vendor_id": "1002", "product_id": "744c"}
# Add specific AMD GPU (id=1)
lxc config device add <container> gpu1 gpu id=1
| Workload | RX 7900 XT (Discrete) | Ryzen 7 7840HS (Integrated) | Performance Ratio |
|---|---|---|---|
| PyTorch MNIST Training | 15 seconds | 45 seconds | 3x slower |
| TensorFlow Image Classification | 120 seconds | 380 seconds | 3.2x slower |
| Matrix Multiplication (4096x4096) | 8 ms | 25 ms | 3.1x slower |