Mounting Datasets for ML Tasks
Learn how to automatically inject datasets into your Concourse CI GPU tasks using the OCI runtime wrapper
- How the automatic dataset injection system works
- Setting up LXC disk mounts for datasets
- Accessing datasets in your ML pipelines
- Testing and verifying dataset availability
- Completed the GPU Workers tutorial
- GPU-enabled worker deployed and running
- Dataset files stored on your host machine
How It Works
The charm includes an intelligent OCI runtime wrapper that automatically discovers and injects all folders under /srv into every task container. This means:
- ā Zero configuration in pipelines - no need to modify your YAML files
- ā
Automatic discovery - any folder mounted to
/srv/*becomes available - ā Read-only by default - prevents accidental data corruption
- ā Works with GPU and non-GPU workers
Host Machine LXC Container Task Container
āāā /data/datasets/ āāā /srv/datasets/ āāā /srv/datasets/
ā āāā training/ ā āāā training/ ā āāā training/
ā āāā validation/ ā āāā validation/ ā āāā validation/
ā āāā test/ ā āāā test/ ā āāā test/
ā ā
āāā OCI Wrapper ā
ā (runc-gpu-wrapper) ā
ā ⢠Discovers /srv/* ā
ā ⢠Injects bind mounts ā
ā ⢠Sets read-only ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāā> āāā Auto-mounted!
Step-by-Step Setup
1 Prepare your dataset directory on the host
# Create a directory structure for your datasets
mkdir -p /data/ml-datasets/imagenet
mkdir -p /data/ml-datasets/validation
# Copy your dataset files
cp -r /path/to/your/training-data/* /data/ml-datasets/imagenet/
# Verify permissions (datasets should be readable)
chmod -R a+rX /data/ml-datasets/
/srv/datasets will be available at that exact path in your tasks.
2 Find your GPU worker's LXC container name
# List all LXC containers
lxc list
# Look for containers with "juju" prefix
# Example output:
# +----------------+---------+
# | NAME | STATE |
# +----------------+---------+
# | juju-abc123-4 | RUNNING | <-- This is your GPU worker
# +----------------+---------+
# Alternative: Get it directly from Juju
juju status gpu-worker --format=json | jq -r '.machines | to_entries[] | .value."container-id"'
3 Mount the dataset directory into the LXC container
# Replace <container-name> with your actual container name
# Mount your dataset to /srv/datasets (read-only)
lxc config device add <container-name> datasets disk \
source=/data/ml-datasets \
path=/srv/datasets \
readonly=true
# Example:
lxc config device add juju-abc123-4 datasets disk \
source=/data/ml-datasets \
path=/srv/datasets \
readonly=true
/srv/ for automatic discovery. Use /srv/datasets, /srv/models, /srv/data, etc.
4 Verify the mount inside the container
# Check the LXC device configuration
lxc config device show <container-name>
# Expected output:
# datasets:
# path: /srv/datasets
# readonly: "true"
# source: /data/ml-datasets
# type: disk
# Verify the files are visible inside the container
lxc exec <container-name> -- ls -lah /srv/datasets/
# You should see your dataset files listed
/srv/datasets into every task container. No pipeline changes needed!
Using Datasets in Pipelines
Now that your datasets are mounted, they're automatically available in any task that runs on your GPU worker. Here's a complete example:
Example 1: PyTorch Training Pipeline
jobs:
- name: train-model
plan:
- task: training
tags: [cuda] # Target GPU workers
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: python3
args:
- -c
- |
import os
import torch
# Dataset is automatically available!
print("Available datasets:")
print(os.listdir('/srv/datasets'))
# Check GPU availability
print(f"\nGPU Available: {torch.cuda.is_available()}")
print(f"GPU Device: {torch.cuda.get_device_name(0)}")
# Your training code here
# data_dir = '/srv/datasets/imagenet'
# model = train_model(data_dir)
Example 2: Verification Task
Create a simple task to verify dataset access before running expensive training jobs:
jobs:
- name: verify-datasets
plan:
- task: check-datasets
tags: [cuda]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: ubuntu
tag: latest
run:
path: bash
args:
- -c
- |
echo "=== Dataset Verification ==="
# Check mount exists
if [ -d "/srv/datasets" ]; then
echo "ā
/srv/datasets exists"
else
echo "ā /srv/datasets not found"
exit 1
fi
# Check read access
echo "\nDataset contents:"
ls -lah /srv/datasets/
# Count files
file_count=$(find /srv/datasets -type f | wc -l)
echo "\nTotal files: $file_count"
# Verify read-only (should fail)
if touch /srv/datasets/write-test 2>&1 | grep -q "Read-only"; then
echo "ā
Confirmed read-only mount"
else
echo "ā ļø Warning: Mount may not be read-only"
fi
echo "\nā
Dataset verification passed!"
Example 3: Multi-Dataset Pipeline
If you have multiple dataset directories mounted, access them by their paths:
# On host: Mount multiple datasets
lxc config device add juju-abc123-4 training-data disk \
source=/data/training \
path=/srv/datasets/training \
readonly=true
lxc config device add juju-abc123-4 validation-data disk \
source=/data/validation \
path=/srv/datasets/validation \
readonly=true
jobs:
- name: train-and-validate
plan:
- task: ml-workflow
tags: [cuda]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: python3
args:
- -c
- |
# All datasets automatically available
train_data = '/srv/datasets/training'
val_data = '/srv/datasets/validation'
print(f"Training samples: {len(os.listdir(train_data))}")
print(f"Validation samples: {len(os.listdir(val_data))}")
# Your ML workflow here
Advanced: Writable Output Directories
By default, all mounts are read-only for data safety. If you need to write model outputs, checkpoints, or results, use the _writable or _rw suffix:
# Mount a writable directory for model outputs
lxc config device add juju-abc123-4 models disk \
source=/data/model-outputs \
path=/srv/models_writable \
readonly=false
# The _writable suffix tells the wrapper to mount it read-write
jobs:
- name: train-and-save
plan:
- task: training
tags: [cuda]
config:
platform: linux
image_resource:
type: registry-image
source:
repository: pytorch/pytorch
tag: latest
run:
path: python3
args:
- -c
- |
import torch
# Read from read-only dataset
train_data = '/srv/datasets/training'
# Write to writable directory
output_dir = '/srv/models_writable'
# Train and save
model = train_model(train_data)
torch.save(model.state_dict(), f'{output_dir}/model.pth')
print(f"Model saved to {output_dir}/model.pth")
Troubleshooting
Dataset not visible in tasks
Check 1: Verify LXC mount
lxc config device show <container-name>
lxc exec <container-name> -- ls -la /srv/datasets/
Check 2: Ensure path is under /srv
The OCI wrapper only discovers folders under /srv/. Paths like /mnt/datasets or /data will NOT work.
Check 3: Verify OCI wrapper is installed
juju ssh gpu-worker/0 -- cat /usr/local/bin/runc-gpu-wrapper | grep -A5 "discover.*srv"
Permission denied errors
# Make dataset directory readable by all users
chmod -R a+rX /data/ml-datasets/
# Check ownership (files should be readable)
ls -la /data/ml-datasets/
Mount not updating after changes
If you change dataset contents but tasks see old data:
# Restart the worker to refresh mounts
juju ssh gpu-worker/0 -- sudo systemctl restart concourse-worker
# Or restart the entire container
lxc restart <container-name>
What You've Accomplished
- ā How the automatic dataset injection system works
- ā Setting up LXC disk mounts for your datasets
- ā Accessing datasets in your ML pipelines without configuration
- ā Using read-only and writable mounts appropriately
- ā Verifying and troubleshooting dataset availability
Next Steps
Now that you can mount datasets, explore more advanced topics:
- General Folder Mounting - Mount any type of folder (caches, configs, artifacts) to any worker
- Set up monitoring - Track GPU utilization and training metrics
- Configuration Reference - Learn about all available options