Mounting Datasets for ML Tasks

Learn how to automatically inject datasets into your Concourse CI GPU tasks using the OCI runtime wrapper

💡 What you'll learn:

How the automatic dataset injection system works
Setting up LXC disk mounts for datasets
Accessing datasets in your ML pipelines
Testing and verifying dataset availability

📚 Prerequisites:

Completed the GPU Workers tutorial
GPU-enabled worker deployed and running
Dataset files stored on your host machine

How It Works

The charm includes an intelligent OCI runtime wrapper that automatically discovers and injects all folders under /srv into every task container. This means:

✅ Zero configuration in pipelines - no need to modify your YAML files
✅ Automatic discovery - any folder mounted to /srv/* becomes available
✅ Read-only by default - prevents accidental data corruption
✅ Works with GPU and non-GPU workers

Host Machine                    LXC Container                  Task Container
├── /data/datasets/            ├── /srv/datasets/             ├── /srv/datasets/
│   ├── training/              │   ├── training/              │   ├── training/
│   ├── validation/            │   ├── validation/            │   ├── validation/
│   └── test/                  │   └── test/                  │   └── test/
                                │                              │
                                ├── OCI Wrapper                │
                                │   (runc-gpu-wrapper)         │
                                │   • Discovers /srv/*         │
                                │   • Injects bind mounts      │
                                │   • Sets read-only          │
                                └────────────────────────────> └── Auto-mounted!

Step-by-Step Setup

1 Prepare your dataset directory on the host

# Create a directory structure for your datasets
mkdir -p /data/ml-datasets/imagenet
mkdir -p /data/ml-datasets/validation

# Copy your dataset files
cp -r /path/to/your/training-data/* /data/ml-datasets/imagenet/

# Verify permissions (datasets should be readable)
chmod -R a+rX /data/ml-datasets/

💡 Tip: Use descriptive directory names. Whatever you mount to /srv/datasets will be available at that exact path in your tasks.

2 Find your GPU worker's LXC container name

# List all LXC containers
lxc list

# Look for containers with "juju" prefix
# Example output:
# +----------------+---------+
# | NAME           | STATE   |
# +----------------+---------+
# | juju-abc123-4  | RUNNING |  <-- This is your GPU worker
# +----------------+---------+

# Alternative: Get it directly from Juju
juju status gpu-worker --format=json | jq -r '.machines | to_entries[] | .value."container-id"'

3 Mount the dataset directory into the LXC container

# Replace <container-name> with your actual container name
# Mount your dataset to /srv/datasets (read-only)
lxc config device add <container-name> datasets disk \
  source=/data/ml-datasets \
  path=/srv/datasets \
  readonly=true

# Example:
lxc config device add juju-abc123-4 datasets disk \
  source=/data/ml-datasets \
  path=/srv/datasets \
  readonly=true

⚠️ Important: The path inside the container MUST be under /srv/ for automatic discovery. Use /srv/datasets, /srv/models, /srv/data, etc.

4 Verify the mount inside the container

# Check the LXC device configuration
lxc config device show <container-name>

# Expected output:
# datasets:
#   path: /srv/datasets
#   readonly: "true"
#   source: /data/ml-datasets
#   type: disk

# Verify the files are visible inside the container
lxc exec <container-name> -- ls -lah /srv/datasets/

# You should see your dataset files listed

✅ Setup complete! The OCI wrapper will now automatically inject /srv/datasets into every task container. No pipeline changes needed!

Using Datasets in Pipelines

Now that your datasets are mounted, they're automatically available in any task that runs on your GPU worker. Here's a complete example:

Example 1: PyTorch Training Pipeline

jobs:
  - name: train-model
    plan:
      - task: training
        tags: [cuda]  # Target GPU workers
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: pytorch/pytorch
              tag: latest
          run:
            path: python3
            args:
              - -c
              - |
                import os
                import torch
                
                # Dataset is automatically available!
                print("Available datasets:")
                print(os.listdir('/srv/datasets'))
                
                # Check GPU availability
                print(f"\nGPU Available: {torch.cuda.is_available()}")
                print(f"GPU Device: {torch.cuda.get_device_name(0)}")
                
                # Your training code here
                # data_dir = '/srv/datasets/imagenet'
                # model = train_model(data_dir)

Example 2: Verification Task

Create a simple task to verify dataset access before running expensive training jobs:

jobs:
  - name: verify-datasets
    plan:
      - task: check-datasets
        tags: [cuda]
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: ubuntu
              tag: latest
          run:
            path: bash
            args:
              - -c
              - |
                echo "=== Dataset Verification ==="
                
                # Check mount exists
                if [ -d "/srv/datasets" ]; then
                  echo "✅ /srv/datasets exists"
                else
                  echo "❌ /srv/datasets not found"
                  exit 1
                fi
                
                # Check read access
                echo "\nDataset contents:"
                ls -lah /srv/datasets/
                
                # Count files
                file_count=$(find /srv/datasets -type f | wc -l)
                echo "\nTotal files: $file_count"
                
                # Verify read-only (should fail)
                if touch /srv/datasets/write-test 2>&1 | grep -q "Read-only"; then
                  echo "✅ Confirmed read-only mount"
                else
                  echo "⚠️  Warning: Mount may not be read-only"
                fi
                
                echo "\n✅ Dataset verification passed!"

Example 3: Multi-Dataset Pipeline

If you have multiple dataset directories mounted, access them by their paths:

# On host: Mount multiple datasets
lxc config device add juju-abc123-4 training-data disk \
  source=/data/training \
  path=/srv/datasets/training \
  readonly=true

lxc config device add juju-abc123-4 validation-data disk \
  source=/data/validation \
  path=/srv/datasets/validation \
  readonly=true

jobs:
  - name: train-and-validate
    plan:
      - task: ml-workflow
        tags: [cuda]
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: pytorch/pytorch
              tag: latest
          run:
            path: python3
            args:
              - -c
              - |
                # All datasets automatically available
                train_data = '/srv/datasets/training'
                val_data = '/srv/datasets/validation'
                
                print(f"Training samples: {len(os.listdir(train_data))}")
                print(f"Validation samples: {len(os.listdir(val_data))}")
                
                # Your ML workflow here

Advanced: Writable Output Directories

By default, all mounts are read-only for data safety. If you need to write model outputs, checkpoints, or results, use the _writable or _rw suffix:

# Mount a writable directory for model outputs
lxc config device add juju-abc123-4 models disk \
  source=/data/model-outputs \
  path=/srv/models_writable \
  readonly=false

# The _writable suffix tells the wrapper to mount it read-write

⚠️ Security Note: Only use writable mounts when necessary. Read-only mounts prevent tasks from accidentally corrupting your training data.

jobs:
  - name: train-and-save
    plan:
      - task: training
        tags: [cuda]
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: pytorch/pytorch
              tag: latest
          run:
            path: python3
            args:
              - -c
              - |
                import torch
                
                # Read from read-only dataset
                train_data = '/srv/datasets/training'
                
                # Write to writable directory
                output_dir = '/srv/models_writable'
                
                # Train and save
                model = train_model(train_data)
                torch.save(model.state_dict(), f'{output_dir}/model.pth')
                print(f"Model saved to {output_dir}/model.pth")

Troubleshooting

Dataset not visible in tasks

Check 1: Verify LXC mount

lxc config device show <container-name>
lxc exec <container-name> -- ls -la /srv/datasets/

Check 2: Ensure path is under /srv

The OCI wrapper only discovers folders under /srv/. Paths like /mnt/datasets or /data will NOT work.

Check 3: Verify OCI wrapper is installed

juju ssh gpu-worker/0 -- cat /usr/local/bin/runc-gpu-wrapper | grep -A5 "discover.*srv"

Permission denied errors

# Make dataset directory readable by all users
chmod -R a+rX /data/ml-datasets/

# Check ownership (files should be readable)
ls -la /data/ml-datasets/

Mount not updating after changes

If you change dataset contents but tasks see old data:

# Restart the worker to refresh mounts
juju ssh gpu-worker/0 -- sudo systemctl restart concourse-worker

# Or restart the entire container
lxc restart <container-name>

What You've Accomplished

🎉 Congratulations! You've learned:

✅ How the automatic dataset injection system works
✅ Setting up LXC disk mounts for your datasets
✅ Accessing datasets in your ML pipelines without configuration
✅ Using read-only and writable mounts appropriately
✅ Verifying and troubleshooting dataset availability

Next Steps

Now that you can mount datasets, explore more advanced topics:

General Folder Mounting - Mount any type of folder (caches, configs, artifacts) to any worker
Set up monitoring - Track GPU utilization and training metrics
Configuration Reference - Learn about all available options

💡 Best Practice: Always start with a verification task to ensure datasets are accessible before running expensive training jobs. Save time and GPU resources by catching mount issues early!