Concourse CI Machine Charm

Documentation

Troubleshooting

Fix common issues with Concourse CI Machine Charm

Charm Shows "Blocked" Status

Cause: Usually means PostgreSQL relation is missing (for web units).

# Fix: Create PostgreSQL relation
juju integrate concourse-ci:postgresql postgresql:database

Web Server Won't Start

Check Logs

# View charm logs
juju debug-log --include concourse-ci/0 --replay --no-tail | tail -50

# SSH to unit and check systemd
juju ssh concourse-ci/0
sudo journalctl -u concourse-server -f

Common Causes

Workers Not Connecting

# Check worker status
juju ssh concourse-ci/1  # Worker unit
sudo systemctl status concourse-worker
sudo journalctl -u concourse-worker -f

Common Causes

GPU Not Detected

NVIDIA GPU

# Check GPU on host
nvidia-smi

# Check LXC device added
lxc config device show <container-name>

# Check inside container
lxc exec <container-name> -- nvidia-smi

# If missing, add GPU device
lxc config device add <container-name> gpu0 gpu

AMD GPU

# Verify GPU device AND /dev/kfd exist
lxc exec <container-name> -- ls -la /dev/dri/
lxc exec <container-name> -- ls -la /dev/kfd

# Add missing /dev/kfd (REQUIRED for ROCm compute)
lxc config device add <container-name> kfd unix-char \
  source=/dev/kfd path=/dev/kfd

# For integrated GPUs, use override in pipeline
export HSA_OVERRIDE_GFX_VERSION=11.0.0

Shared Storage Issues

"Waiting for shared storage mount"

# Run setup script to mount storage
./scripts/setup-shared-storage.sh <app-name> /path/to/shared

# Verify mount exists
juju ssh <unit> -- mount | grep concourse

Permission Denied in Shared Storage

# Ensure LXC device uses shift=true for UID/GID mapping
lxc config device show <container-name>

# Should show: shift: "true"

Tasks Failing to Start

Missing Image Resource

This charm uses containerd runtime. All tasks must include an image_resource:

# ❌ Wrong - no image_resource
config:
  platform: linux
  run:
    path: echo
    args: ["hello"]

# ✅ Correct - with image_resource
config:
  platform: linux
  image_resource:
    type: registry-image
    source:
      repository: busybox
  run:
    path: echo
    args: ["hello"]

Container Pull Failures

# Check containerd status
juju ssh <worker-unit>
sudo systemctl status containerd

# Check DNS configuration
cat /etc/resolv.conf

# Enable containerd DNS proxy if needed
juju config <app> containerd-dns-proxy-enable=true
juju config <app> containerd-dns-server="8.8.8.8,1.1.1.1"

Folder Mounts Not Working

Folders Not Visible in Tasks

# 1. Path MUST be under /srv
# ✅ Correct
path=/srv/datasets

# ❌ Wrong - not under /srv
path=/mnt/datasets

# 2. Verify LXC mount exists
lxc config device show <container-name>
lxc exec <container-name> -- ls -la /srv/

Cannot Write to Writable Folder

# Folder name MUST end with _writable or _rw
# ✅ Correct
path=/srv/outputs_writable

# ❌ Wrong - missing suffix
path=/srv/outputs

# LXC device must not be readonly
lxc config device show <container-name>
# Should NOT show: readonly: "true" for writable folders

Upgrade Failed

# Rollback to previous version
juju config concourse-ci version=<previous-version>

# Check error logs
juju debug-log --include concourse-ci --replay | grep -i error

# Verify database connectivity
juju ssh concourse-ci/0
sudo journalctl -u concourse-server | grep -i database

View Configuration

# Check all config values
juju config concourse-ci

# View runtime config file
juju ssh concourse-ci/0
sudo cat /var/lib/concourse/config.env

Check Service Status

# Web server
juju ssh concourse-ci/0
sudo systemctl status concourse-server
sudo journalctl -u concourse-server -n 100

# Worker
juju ssh concourse-ci/1
sudo systemctl status concourse-worker
sudo journalctl -u concourse-worker -n 100

# Containerd (on workers)
sudo systemctl status containerd

Database Connection Issues

# Verify PostgreSQL relation
juju status --relations

# Check database credentials (stored in Juju secrets)
juju ssh concourse-ci/0
sudo cat /var/lib/concourse/config.env | grep DATABASE

# Test connection manually
sudo -u concourse psql <connection-string>

Get Help

If issues persist:

Related Documentation