ROCm Verification Reference - Concourse CI Machine Charm

Overview

This reference provides complete specifications for verifying AMD GPU support in Concourse CI workers. ROCm (Radeon Open Compute) is AMD's platform for GPU compute, supporting machine learning frameworks like PyTorch and TensorFlow.

Note: ROCm support requires AMD GPU hardware, amdgpu kernel module, and /dev/kfd device access. Integrated AMD GPUs (APUs) require additional workarounds (see below).

Supported AMD GPU Types

GPU Type	Support Level	Workaround Required	Production Ready
Discrete GPUs (RX 6000/7000, Radeon Pro, Instinct MI)	Full Native Support	No	✅ Yes
Integrated GPUs (APUs) (Phoenix1/gfx1103, Renoir/gfx90c, Cezanne/gfx90c)	Experimental with Workaround	Yes (`HSA_OVERRIDE_GFX_VERSION`)	❌ No (Dev/Test only)

Discrete GPU Examples

AMD Radeon RX 7900 XT/XTX (RDNA 3, gfx1100)
AMD Radeon RX 6000 series (RDNA 2, gfx1030)
AMD Radeon Pro W6000/W7000 series
AMD Instinct MI200/MI100 (data center)

Integrated GPU Examples

AMD Ryzen 7000 series (Phoenix1, gfx1103) - Requires HSA_OVERRIDE_GFX_VERSION=11.0.0
AMD Ryzen 5000 series (Cezanne, gfx90c) - Requires HSA_OVERRIDE_GFX_VERSION=9.0.0
AMD Ryzen 4000 series (Renoir, gfx90c) - Requires HSA_OVERRIDE_GFX_VERSION=9.0.0

Warning: Integrated GPUs share system memory and use suboptimal ROCm kernels. Performance is significantly lower than discrete GPUs. Not recommended for production ML training workloads.

Verification Commands

Host-Level Verification

Run these commands on the host machine to verify AMD GPU hardware and drivers.

1. Check GPU Hardware

# List AMD GPUs
lspci | grep -i amd

# Expected output (example):
# 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT]
# 06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (integrated)

2. Check AMD GPU Driver

# Verify amdgpu kernel module loaded
lsmod | grep amdgpu

# Expected output:
# amdgpu              12345678  0
# drm_ttm_helper        16384  1 amdgpu
# ...

3. Check DRM Devices

# List DRM devices
ls -la /dev/dri/

# Expected output:
# crw-rw----+ 1 root video 226,   0 Feb  4 09:00 card0
# crw-rw----+ 1 root video 226,   1 Feb  4 09:00 card1
# crw-rw----+ 1 root render 226, 128 Feb  4 09:00 renderD128
# crw-rw----+ 1 root render 226, 129 Feb  4 09:00 renderD129

4. Check /dev/kfd (Critical for Compute)

# Verify KFD device exists
ls -la /dev/kfd

# Expected output:
# crw-rw-rw- 1 root root 236, 0 Feb  4 09:00 /dev/kfd

Critical: /dev/kfd (Kernel Fusion Driver) is required for ROCm compute workloads. PyTorch and TensorFlow will not detect the GPU without this device. rocm-smi works without it (monitoring only), but compute operations fail.

5. Query GPU Information (LXC)

# Query all GPU cards with detailed information
lxc query /1.0/resources | jq '.gpu.cards[] | {id: .drm.id, driver, driver_version, vendor_id, product_id}'

# Expected output (example):
# {
#   "id": 0,
#   "driver": "nvidia",
#   "driver_version": "580.95",
#   "vendor_id": "10de",
#   "product_id": "2484"
# }
# {
#   "id": 1,
#   "driver": "amdgpu",
#   "driver_version": "5.15.0-97-generic",
#   "vendor_id": "1002",
#   "product_id": "744c"
# }

Container-Level Verification

Run these commands inside the Concourse worker container to verify GPU passthrough.

1. Check DRM Devices in Container

# SSH into worker unit
juju ssh worker/0

# List DRM devices
ls -la /dev/dri/

# Expected output (same devices as host):
# crw-rw----+ 1 root video 226,   0 Feb  4 09:00 card0
# crw-rw----+ 1 root render 226, 128 Feb  4 09:00 renderD128

2. Check /dev/kfd in Container

# Inside worker container
ls -la /dev/kfd

# Expected output:
# crw-rw-rw- 1 root root 236, 0 Feb  4 09:00 /dev/kfd

Common Issue: /dev/kfd is often missing in containers even when /dev/dri/* devices are present. This causes PyTorch to report "CUDA (ROCm) available: False". Solution: lxc config device add <container> kfd unix-char source=/dev/kfd path=/dev/kfd

3. Check ROCm Installation

# Inside worker container
which rocm-smi

# Expected output:
# /opt/rocm/bin/rocm-smi

4. Run rocm-smi

# Inside worker container
rocm-smi

# Expected output (example):
# ======================= ROCm System Management Interface =======================
# ================================= Concise Info =================================
# GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
# 0    35.0c           20.0W   800Mhz  1000Mhz  0%   auto  203.0W    0%   0%
# ================================================================================

PyTorch Verification in Concourse Task

Verify GPU access from within a Concourse CI task container.

Discrete GPU Test

jobs:
- name: verify-rocm-discrete
  plan:
  - task: test-gpu
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Check ROCm availability
          rocm-smi
          
          # Check devices
          ls -la /dev/dri/ /dev/kfd
          
          # PyTorch GPU test
          python3 -c "
          import torch
          print('PyTorch version:', torch.__version__)
          print('CUDA (ROCm) available:', torch.cuda.is_available())
          print('GPU count:', torch.cuda.device_count())
          if torch.cuda.is_available():
              print('GPU name:', torch.cuda.get_device_name(0))
              x = torch.rand(5, 3).cuda()
              y = x * 2
              print('GPU computation succeeded!')
              print('Result:', y)
          "

Integrated GPU Test (with HSA_OVERRIDE_GFX_VERSION)

jobs:
- name: verify-rocm-integrated
  plan:
  - task: test-gpu
    tags: [rocm]
    config:
      platform: linux
      image_resource:
        type: registry-image
        source:
          repository: rocm/pytorch
          tag: latest
      run:
        path: sh
        args:
        - -c
        - |
          # Set override for gfx1103 (Phoenix1 APU)
          export HSA_OVERRIDE_GFX_VERSION=11.0.0
          
          # Check GPU architecture
          rocm-smi --showproductname
          
          # PyTorch GPU test
          python3 -c "
          import torch
          print('PyTorch version:', torch.__version__)
          print('CUDA (ROCm) available:', torch.cuda.is_available())
          if torch.cuda.is_available():
              print('GPU name:', torch.cuda.get_device_name(0))
              x = torch.rand(5, 3).cuda()
              y = x * 2
              print('GPU computation succeeded!')
              print('Result:', y)
          else:
              print('ERROR: GPU not detected. Check /dev/kfd and HSA_OVERRIDE_GFX_VERSION.')
          "

HSA_OVERRIDE_GFX_VERSION Workaround

Why It's Needed

Integrated AMD GPUs (APUs) use GFX architectures not officially supported by ROCm. ROCm checks the GPU's GFX version and rejects unsupported versions. The HSA_OVERRIDE_GFX_VERSION environment variable tells ROCm to use compute kernels from a supported architecture instead.

Override Values Table

GPU Architecture	GFX Version	Override Value	Examples
Phoenix1 (RDNA 3)	gfx1103	`11.0.0`	Ryzen 7 7840HS (780M iGPU)
Renoir (Zen 2)	gfx90c	`9.0.0`	Ryzen 4000 series (Vega iGPU)
Cezanne (Zen 3)	gfx90c	`9.0.0`	Ryzen 5000 series (Vega iGPU)

How to Use in Concourse Tasks

Add export HSA_OVERRIDE_GFX_VERSION=<value> at the beginning of your task script:

run:
  path: sh
  args:
  - -c
  - |
    # Set override BEFORE importing PyTorch/TensorFlow
    export HSA_OVERRIDE_GFX_VERSION=11.0.0
    
    # Your GPU workload
    python3 train.py --use-gpu

Limitations

⚠️ Suboptimal Kernels: Uses compute kernels designed for different architecture → lower performance
⚠️ Shared Memory: Integrated GPUs share system RAM → memory bandwidth bottleneck
⚠️ Incomplete Support: Some ROCm features may not work or crash
✅ Good For: Development, testing, experimentation, light compute
❌ Bad For: Production ML training, high-throughput inference, large models

Testing Override on Host

# Test integrated GPU with Docker before deploying pipeline
docker run --rm -it --device=/dev/kfd --device=/dev/dri \
  rocm/pytorch:latest sh -c "
    export HSA_OVERRIDE_GFX_VERSION=11.0.0
    python3 -c 'import torch; print(torch.cuda.is_available()); x = torch.rand(5,3).cuda(); print(x * 2)'
  "

Common Issues and Solutions

Issue: "GPU enabled but no GPU detected"

Symptom: Charm status shows "GPU enabled" but worker doesn't detect GPU.

Causes & Solutions:

Cause	Verification Command	Solution
No AMD GPU hardware	`lspci \| grep -i amd`	Verify GPU is installed and recognized by host
amdgpu driver not loaded	`lsmod \| grep amdgpu`	`modprobe amdgpu`
Missing /dev/dri/ devices	`ls -la /dev/dri/`	Check driver installation, reboot if necessary

Issue: "CUDA (ROCm) available: False" in PyTorch

Symptom: torch.cuda.is_available() returns False in task containers.

Causes & Solutions (in order of likelihood):

Cause	Verification Command	Solution
Missing /dev/kfd (most common)	`juju ssh worker/0 -- ls -la /dev/kfd`	`lxc config device add <container> kfd unix-char source=/dev/kfd path=/dev/kfd`
Integrated GPU without override	`rocm-smi --showproductname`	Add `export HSA_OVERRIDE_GFX_VERSION=11.0.0` to task script (adjust version for your GPU)
Wrong LXC GPU passthrough	`lxc config device show <container>`	Use `lxc config device add <container> gpu1 gpu id=1` to target specific AMD GPU (not generic `gpu`)
Unsupported GPU	`cat /sys/class/drm/card*/device/uevent \| grep PCI_ID`	Check GPU compatibility: ROCm GPU Support

Issue: "rocm-smi works but PyTorch doesn't detect GPU"

Symptom: rocm-smi shows GPU info, but PyTorch/TensorFlow can't use GPU.

Root Cause: /dev/kfd is missing or inaccessible.

Explanation:

rocm-smi only needs /dev/dri/* devices for monitoring (temperature, clock speeds, etc.)
PyTorch and TensorFlow need /dev/kfd for GPU compute operations (kernel submission, memory management)

Solution:

# Add /dev/kfd device to LXC container
lxc config device add <container-name> kfd unix-char source=/dev/kfd path=/dev/kfd

# Restart worker service
juju ssh worker/0 -- sudo systemctl restart concourse-worker

Issue: "HSA_STATUS_ERROR_OUT_OF_RESOURCES"

Symptom: PyTorch raises HSA_STATUS_ERROR_OUT_OF_RESOURCES exception.