ROCm Verification Reference

AMD GPU support, verification commands, and troubleshooting

Overview

This reference provides complete specifications for verifying AMD GPU support in Concourse CI workers. ROCm (Radeon Open Compute) is AMD's platform for GPU compute, supporting machine learning frameworks like PyTorch and TensorFlow.

Note: ROCm support requires AMD GPU hardware, amdgpu kernel module, and /dev/kfd device access. Integrated AMD GPUs (APUs) require additional workarounds (see below).

Supported AMD GPU Types

GPU Type Support Level Workaround Required Production Ready
Discrete GPUs
(RX 6000/7000, Radeon Pro, Instinct MI)
Full Native Support No ✅ Yes
Integrated GPUs (APUs)
(Phoenix1/gfx1103, Renoir/gfx90c, Cezanne/gfx90c)
Experimental with Workaround Yes (HSA_OVERRIDE_GFX_VERSION) ❌ No (Dev/Test only)

Discrete GPU Examples

Integrated GPU Examples

Warning: Integrated GPUs share system memory and use suboptimal ROCm kernels. Performance is significantly lower than discrete GPUs. Not recommended for production ML training workloads.

Verification Commands

Host-Level Verification

Run these commands on the host machine to verify AMD GPU hardware and drivers.

1. Check GPU Hardware

# List AMD GPUs
lspci | grep -i amd

# Expected output (example):
# 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT]
# 06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (integrated)

2. Check AMD GPU Driver

# Verify amdgpu kernel module loaded
lsmod | grep amdgpu

# Expected output:
# amdgpu              12345678  0
# drm_ttm_helper        16384  1 amdgpu
# ...

3. Check DRM Devices

# List DRM devices
ls -la /dev/dri/

# Expected output:
# crw-rw----+ 1 root video 226,   0 Feb  4 09:00 card0
# crw-rw----+ 1 root video 226,   1 Feb  4 09:00 card1
# crw-rw----+ 1 root render 226, 128 Feb  4 09:00 renderD128
# crw-rw----+ 1 root render 226, 129 Feb  4 09:00 renderD129

4. Check /dev/kfd (Critical for Compute)

# Verify KFD device exists
ls -la /dev/kfd

# Expected output:
# crw-rw-rw- 1 root root 236, 0 Feb  4 09:00 /dev/kfd
Critical: /dev/kfd (Kernel Fusion Driver) is required for ROCm compute workloads. PyTorch and TensorFlow will not detect the GPU without this device. rocm-smi works without it (monitoring only), but compute operations fail.

5. Query GPU Information (LXC)

# Query all GPU cards with detailed information
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, driver_version, vendor_id, product_id}'

# Expected output (example):
# {
#   "id": 0,
#   "driver": "nvidia",
#   "driver_version": "580.95",
#   "vendor_id": "10de",
#   "product_id": "2484"
# }
# {
#   "id": 1,
#   "driver": "amdgpu",
#   "driver_version": "5.15.0-97-generic",
#   "vendor_id": "1002",
#   "product_id": "744c"
# }

Container-Level Verification

Run these commands inside the Concourse worker container to verify GPU passthrough.

1. Check DRM Devices in Container

# SSH into worker unit
juju ssh worker/0

# List DRM devices
ls -la /dev/dri/

# Expected output (same devices as host):
# crw-rw----+ 1 root video 226,   0 Feb  4 09:00 card0
# crw-rw----+ 1 root render 226, 128 Feb  4 09:00 renderD128

2. Check /dev/kfd in Container

# Inside worker container
ls -la /dev/kfd

# Expected output:
# crw-rw-rw- 1 root root 236, 0 Feb  4 09:00 /dev/kfd
Common Issue: /dev/kfd is often missing in containers even when /dev/dri/* devices are present. This causes PyTorch to report "CUDA (ROCm) available: False". Solution: lxc config device add <container> kfd unix-char source=/dev/kfd path=/dev/kfd

3. Check ROCm Installation

# Inside worker container
which rocm-smi

# Expected output:
# /opt/rocm/bin/rocm-smi

4. Run rocm-smi

# Inside worker container
rocm-smi

# Expected output (example):
# ======================= ROCm System Management Interface =======================
# ================================= Concise Info =================================
# GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
# 0    35.0c           20.0W   800Mhz  1000Mhz  0%   auto  203.0W    0%   0%
# ================================================================================

PyTorch Verification in Concourse Task

Verify GPU access from within a Concourse CI task container.

Discrete GPU Test

jobs:
- name: verify-rocm-discrete
  plan:
  - task: test-gpu
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Check ROCm availability
          rocm-smi
          
          # Check devices
          ls -la /dev/dri/ /dev/kfd
          
          # PyTorch GPU test
          python3 -c "
          import torch
          print('PyTorch version:', torch.__version__)
          print('CUDA (ROCm) available:', torch.cuda.is_available())
          print('GPU count:', torch.cuda.device_count())
          if torch.cuda.is_available():
              print('GPU name:', torch.cuda.get_device_name(0))
              x = torch.rand(5, 3).cuda()
              y = x * 2
              print('GPU computation succeeded!')
              print('Result:', y)
          "

Integrated GPU Test (with HSA_OVERRIDE_GFX_VERSION)

jobs:
- name: verify-rocm-integrated
  plan:
  - task: test-gpu
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Set override for gfx1103 (Phoenix1 APU)
          export HSA_OVERRIDE_GFX_VERSION=11.0.0
          
          # Check GPU architecture
          rocm-smi --showproductname
          
          # PyTorch GPU test
          python3 -c "
          import torch
          print('PyTorch version:', torch.__version__)
          print('CUDA (ROCm) available:', torch.cuda.is_available())
          if torch.cuda.is_available():
              print('GPU name:', torch.cuda.get_device_name(0))
              x = torch.rand(5, 3).cuda()
              y = x * 2
              print('GPU computation succeeded!')
              print('Result:', y)
          else:
              print('ERROR: GPU not detected. Check /dev/kfd and HSA_OVERRIDE_GFX_VERSION.')
          "

HSA_OVERRIDE_GFX_VERSION Workaround

Why It's Needed

Integrated AMD GPUs (APUs) use GFX architectures not officially supported by ROCm. ROCm checks the GPU's GFX version and rejects unsupported versions. The HSA_OVERRIDE_GFX_VERSION environment variable tells ROCm to use compute kernels from a supported architecture instead.

Override Values Table

GPU Architecture GFX Version Override Value Examples
Phoenix1 (RDNA 3) gfx1103 11.0.0 Ryzen 7 7840HS (780M iGPU)
Renoir (Zen 2) gfx90c 9.0.0 Ryzen 4000 series (Vega iGPU)
Cezanne (Zen 3) gfx90c 9.0.0 Ryzen 5000 series (Vega iGPU)

How to Use in Concourse Tasks

Add export HSA_OVERRIDE_GFX_VERSION=<value> at the beginning of your task script:

run:
  path: sh
  args:
  - -c
  - |
    # Set override BEFORE importing PyTorch/TensorFlow
    export HSA_OVERRIDE_GFX_VERSION=11.0.0
    
    # Your GPU workload
    python3 train.py --use-gpu

Limitations

Testing Override on Host

# Test integrated GPU with Docker before deploying pipeline
docker run --rm -it --device=/dev/kfd --device=/dev/dri \
  rocm/pytorch:latest sh -c "
    export HSA_OVERRIDE_GFX_VERSION=11.0.0
    python3 -c 'import torch; print(torch.cuda.is_available()); x = torch.rand(5,3).cuda(); print(x * 2)'
  "

Common Issues and Solutions

Issue: "GPU enabled but no GPU detected"

Symptom: Charm status shows "GPU enabled" but worker doesn't detect GPU.

Causes & Solutions:

Cause Verification Command Solution
No AMD GPU hardware lspci | grep -i amd Verify GPU is installed and recognized by host
amdgpu driver not loaded lsmod | grep amdgpu modprobe amdgpu
Missing /dev/dri/ devices ls -la /dev/dri/ Check driver installation, reboot if necessary

Issue: "CUDA (ROCm) available: False" in PyTorch

Symptom: torch.cuda.is_available() returns False in task containers.

Causes & Solutions (in order of likelihood):

Cause Verification Command Solution
Missing /dev/kfd (most common) juju ssh worker/0 -- ls -la /dev/kfd lxc config device add <container> kfd unix-char source=/dev/kfd path=/dev/kfd
Integrated GPU without override rocm-smi --showproductname Add export HSA_OVERRIDE_GFX_VERSION=11.0.0 to task script (adjust version for your GPU)
Wrong LXC GPU passthrough lxc config device show <container> Use lxc config device add <container> gpu1 gpu id=1 to target specific AMD GPU (not generic gpu)
Unsupported GPU cat /sys/class/drm/card*/device/uevent | grep PCI_ID Check GPU compatibility: ROCm GPU Support

Issue: "rocm-smi works but PyTorch doesn't detect GPU"

Symptom: rocm-smi shows GPU info, but PyTorch/TensorFlow can't use GPU.

Root Cause: /dev/kfd is missing or inaccessible.

Explanation:

Solution:

# Add /dev/kfd device to LXC container
lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd

# Restart worker service
juju ssh worker/0 -- sudo systemctl restart concourse-worker

Issue: "HSA_STATUS_ERROR_OUT_OF_RESOURCES"

Symptom: PyTorch raises HSA_STATUS_ERROR_OUT_OF_RESOURCES exception.

Causes & Solutions:

Issue: Multi-GPU System Detects Wrong GPU

Symptom: Worker detects NVIDIA GPU when AMD GPU is desired (or vice versa).

Cause: Generic lxc config device add ... gpu passes all GPUs to container.

Solution: Use specific GPU ID:

# Query GPU IDs
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, vendor_id, product_id}'

# Output example:
# {"id": 0, "driver": "nvidia", "vendor_id": "10de", "product_id": "2484"}
# {"id": 1, "driver": "amdgpu", "vendor_id": "1002", "product_id": "744c"}

# Add specific AMD GPU (id=1)
lxc config device add <container> gpu1 gpu id=1

Performance Expectations

Discrete GPU Performance

Integrated GPU Performance

Benchmark Comparison (Example)

Workload RX 7900 XT (Discrete) Ryzen 7 7840HS (Integrated) Performance Ratio
PyTorch MNIST Training 15 seconds 45 seconds 3x slower
TensorFlow Image Classification 120 seconds 380 seconds 3.2x slower
Matrix Multiplication (4096x4096) 8 ms 25 ms 3.1x slower

Further Reading