This guide explains how to mount datasets into Concourse GPU worker tasks using LXC disk devices. The charm automatically makes datasets available to task containers via OCI runtime injection.
We provide a helper script to make mounting datasets easy:
# 1. Mount your dataset
./scripts/mount-datasets.sh gpu-worker /path/to/your/datasets
# 2. Done! Access /srv/datasets in your pipelines
Note: This guide focuses on GPU-specific dataset mounting. For general folder mounting (including writable folders, multiple paths, and non-GPU workers), see the General Folder Mounting Guide.
The GPU worker charm includes an OCI runtime wrapper (runc-gpu-wrapper) that automatically discovers and injects all folders under /srv into every task container. The /srv/datasets folder is treated like any other folder in the automatic discovery system—mounted as read-only by default.
Host Machine
└── /path/to/datasets/
│
├─ LXC Device Mount ──> LXC Container
│ └── /srv/datasets/
│ │
└─────────────────────────────┼─ OCI Wrapper Discovery & Injection
│ (Automatic for ALL /srv folders)
│
└─> Task Container
└── /srv/datasets/ (read-only)
As of charm revision 38+, the mounting system has been enhanced to support:
/srv (not just /srv/datasets)_writable or _rw suffixFor non-dataset use cases, see general-mounting.md for:
# Deploy GPU worker with the charm
juju deploy concourse-ci-machine gpu-worker \
--config mode=worker \
--config enable-gpu=true
# Find the container name for your GPU worker
juju status gpu-worker
# The machine ID will be something like "4"
# The LXC container name will be: juju-<model-id>-<machine-id>
# Example: juju-e16396-4
# Add a disk device to the LXC container
lxc config device add <container-name> datasets disk \
source=/path/to/your/datasets \
path=/srv/datasets \
readonly=true
# Example:
lxc config device add juju-e16396-4 datasets disk \
source=/home/user/ml-datasets \
path=/srv/datasets \
readonly=true
# Check the device is configured
lxc config device show <container-name>
# Verify mount inside container
lxc exec <container-name> -- ls -lah /srv/datasets/
# Expected output: Your dataset files should be visible
The charm’s OCI wrapper automatically detects /srv/datasets and injects it into every GPU task container. No pipeline changes needed!
Tasks tagged with [gpu] automatically have /srv/datasets mounted:
jobs:
- name: train-model
plan:
- task: training
tags: [gpu] # GPU workers automatically get /srv/datasets
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: python
args:
- -c
- |
# Dataset is automatically available
import os
print(f"Datasets: {os.listdir('/srv/datasets')}")
# Read data
with open('/srv/datasets/training-data.csv') as f:
data = f.read()
Create a simple test file:
# On host machine
echo "Dataset test successful!" > /path/to/datasets/test.txt
# Add to LXC container
lxc config device add <container-name> datasets disk \
source=/path/to/datasets \
path=/srv/datasets \
readonly=true
Run verification pipeline:
jobs:
- name: verify-dataset
plan:
- task: check-mount
tags: [gpu]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: ubuntu
tag: latest
run:
path: bash
args:
- -c
- |
echo "=== Dataset Verification ==="
# Check mount exists
if [ -d "/srv/datasets" ]; then
echo "✅ /srv/datasets exists"
else
echo "❌ /srv/datasets not found"
exit 1
fi
# Check read access
ls -lah /srv/datasets/
# Check test file
if cat /srv/datasets/test.txt; then
echo "✅ Dataset read successful"
else
echo "❌ Cannot read dataset"
exit 1
fi
# Verify read-only
if touch /srv/datasets/write-test 2>&1 | grep -q "Read-only"; then
echo "✅ Confirmed read-only mount"
else
echo "⚠️ Warning: Mount may not be read-only"
fi
You can mount multiple dataset directories to different paths:
# Training datasets
lxc config device add <container-name> training-data disk \
source=/path/to/training-data \
path=/srv/datasets/training \
readonly=true
# Validation datasets
lxc config device add <container-name> validation-data disk \
source=/path/to/validation-data \
path=/srv/datasets/validation \
readonly=true
# Model checkpoints
lxc config device add <container-name> checkpoints disk \
source=/path/to/checkpoints \
path=/srv/models \
readonly=false # Allow write for saving models
lxc config device show <container-name>
lxc exec <container-name> -- ls -la /srv/datasets/
lxc exec <container-name> -- cat /usr/local/bin/runc-gpu-wrapper
Should include dataset mount injection logic.
juju debug-log --include gpu-worker/0 --tail 100
If you see permission errors:
# Make dataset directory readable
chmod -R a+rX /path/to/datasets/
# Verify ownership
ls -la /path/to/datasets/
If you change the dataset content but tasks see old data:
# Restart the worker to refresh mounts
juju run gpu-worker/0 restart
# Or restart the container
lxc restart <container-name>
# 1. Organize datasets on host
mkdir -p /data/ml-datasets/{training,validation,test}
cp your-data.csv /data/ml-datasets/training/
# 2. Deploy GPU worker
juju deploy concourse-ci-machine gpu-worker \
--config mode=worker \
--config enable-gpu=true
# 3. Wait for deployment
juju status --watch 1s
# 4. Mount datasets
CONTAINER=$(lxc list | grep gpu-worker | awk '{print $2}')
lxc config device add $CONTAINER datasets disk \
source=/data/ml-datasets \
path=/srv/datasets \
readonly=true
# 5. Mount model output directory (writable)
lxc config device add $CONTAINER models disk \
source=/data/ml-models \
path=/srv/models \
readonly=false
# 6. Verify setup
fly -t local execute -c verify-datasets.yaml --tag gpu
The automatic dataset mounting integrates seamlessly with your CI/CD workflows:
jobs:
- name: model-training-pipeline
plan:
# 1. Get code from repository
- get: ml-code
trigger: true
# 2. Train model with GPU and datasets
- task: train
tags: [gpu]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
inputs:
- name: ml-code
run:
path: bash
args:
- -c
- |
# Code is in ml-code/
# Datasets automatically available in /srv/datasets/
cd ml-code
python train.py \
--data /srv/datasets/training \
--output /tmp/model.pth
# 3. Validate model
- task: validate
tags: [gpu]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: python
args:
- -c
- |
# Validation data automatically available
import torch
validate_model('/tmp/model.pth', '/srv/datasets/validation')
Use environment variables to select different datasets per pipeline run:
jobs:
- name: train-with-dataset
plan:
- task: training
tags: [gpu]
params:
DATASET_NAME: "imagenet-subset-v2"
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: bash
args:
- -c
- |
DATASET_PATH="/srv/datasets/${DATASET_NAME}"
if [ ! -d "$DATASET_PATH" ]; then
echo "Error: Dataset $DATASET_NAME not found"
exit 1
fi
echo "Training with dataset: $DATASET_NAME"
python train.py --data "$DATASET_PATH"
The dataset mounting system provides:
✅ Automatic Injection: No pipeline modifications required ✅ Secure by Default: Read-only mounts prevent accidental data corruption ✅ Flexible Configuration: Mount multiple datasets to different paths ✅ High Performance: Direct access to host storage ✅ Simple Setup: Just configure LXC device and it works
For questions or issues, check the troubleshooting section or open an issue on GitHub.