Understanding Container Runtime
Why containerd (not Docker), why workers run as root, and the container isolation model
Why Containerd, Not Docker?
The charm uses containerd as the container runtime, not Docker. This is a deliberate choice made by Concourse CI upstream, not this charm. Understanding why requires understanding the container ecosystem evolution:
The Container Stack
┌────────────────────────┐
│ Concourse Worker │ ← Our charm installs this
├────────────────────────┤
│ containerd │ ← Container runtime (charm installs)
├────────────────────────┤
│ runc │ ← OCI runtime (spawns containers)
├────────────────────────┤
│ Linux Kernel │ ← Namespaces, cgroups
│ (namespaces, cgroups) │
└────────────────────────┘
Docker, by contrast, adds additional layers:
┌────────────────────────┐
│ docker CLI │ ← User-facing tool
├────────────────────────┤
│ dockerd │ ← Docker daemon (image management, networking)
├────────────────────────┤
│ containerd │ ← Actually runs containers
├────────────────────────┤
│ runc │ ← OCI runtime
└────────────────────────┘
Why Containerd is Better for Concourse
| Requirement | containerd | Docker |
|---|---|---|
| Lightweight | ✅ ~50MB binary, minimal daemon | ❌ ~200MB, heavyweight daemon with many features Concourse doesn't need |
| OCI-compliant | ✅ Native OCI support | ✅ Via containerd backend |
| No unnecessary features | ✅ Just container lifecycle management | ❌ Docker Swarm, Docker Compose, legacy image formats |
| Kubernetes-compatible | ✅ Used by Kubernetes since 1.20 | ❌ Deprecated as K8s runtime |
| Direct control | ✅ Concourse talks directly to containerd | ❌ Extra layer of indirection |
What Concourse Doesn't Need from Docker
- docker CLI: Workers are API-driven, not user-interactive
- Docker Compose: Pipelines define multi-container workflows, not Compose files
- Docker Swarm: Concourse is its own orchestrator
- BuildKit: Concourse uses OCI Buildkit directly if needed
- Legacy image formats: Modern images are OCI-compliant
Why Workers Run as Root
The concourse-worker systemd service runs as root, not an unprivileged user. This seems surprising given security best practices, but it's necessary for several reasons:
Technical Requirements
| Capability Needed | Why Root is Required |
|---|---|
| Create user namespaces | Containers need isolated UID/GID spaces. Requires CAP_SYS_ADMIN. |
| Mount filesystems | Task containers need bind mounts for caches, inputs, outputs. Requires CAP_SYS_ADMIN. |
| Manage cgroups | Resource limits (CPU, memory) enforced via cgroups. Requires root or CAP_SYS_ADMIN. |
| Network namespaces | Isolated networking per container. Requires CAP_NET_ADMIN. |
| Device access | GPUs, block devices need /dev access. Requires root or device ownership. |
The Rootless Containers Myth
You might have heard of "rootless Docker" or "rootless Podman." These technologies allow launching containers without root, but with significant trade-offs:
| Feature | Rootless Mode | Impact on Concourse |
|---|---|---|
| Network modes | ❌ No bridge networking | Concourse tasks need network isolation |
| Port binding <1024 | ❌ Privileged ports blocked | Some tasks need to bind port 80/443 |
| Cgroup limits | ⚠️ Limited enforcement | Can't reliably limit task resources |
| GPU passthrough | ❌ No device access | GPU workers would be impossible |
| Overlay filesystems | ⚠️ Fuse-overlay (slow) | Performance degradation for image layers |
Verdict: Rootless mode sacrifices too many features Concourse relies on. The security benefits don't outweigh the functional limitations.
Security Mitigations
Running as root doesn't mean "no security." The charm implements several layers of protection:
- Container isolation: Task containers run in namespaces with limited capabilities
- AppArmor/SELinux profiles: Kernel-level MAC (Mandatory Access Control)
- Seccomp filters: Restrict syscalls available to containers
- Network policies: Firewall rules limit worker attack surface
- Read-only root filesystem: Worker binary directories mounted read-only
Container Isolation Model
Concourse uses Linux namespaces and cgroups to isolate task containers. Understanding this model explains what tasks can and cannot do:
Namespace Isolation
| Namespace | What's Isolated | Concourse Usage |
|---|---|---|
| PID | Process IDs (containers see own PID 1) | Tasks can't see other task processes |
| Mount | Filesystem mounts | Each task has own rootfs from image |
| Network | Network stack (IP, routes, firewall) | Tasks have isolated networking (bridge mode) |
| UTS | Hostname and domain name | Each task has unique hostname |
| IPC | Inter-process communication (shared memory, semaphores) | Tasks can't IPC with other tasks |
| User | UID/GID mappings | Task UID 0 maps to host UID 100000+ (non-privileged) |
What Tasks CAN Do
- Bind to any port: Network namespace allows full port range (task's port 80 doesn't conflict with other tasks)
- Install packages: Tasks have full
apt/yum/apkaccess within their rootfs - Run as root: Inside the container, UID 0 is root (but mapped to unprivileged UID on host)
- Create files: Task has full write access to its own filesystem (ephemeral)
- Use full CPU: Unless limited by cgroups
What Tasks CANNOT Do
- See other tasks: PID namespace prevents
ps auxfrom showing other tasks - Access host filesystem: Mount namespace isolates
/unless bind-mounted explicitly - Modify kernel: Seccomp blocks syscalls like
reboot,load_module - Escape to host: User namespace remaps root to unprivileged UID
- Persist data: Task filesystem is ephemeral (deleted after task completes)
Privileged Containers: The Exception
Concourse supports privileged containers via the privileged: true flag in task configs. This disables most isolation:
task: build-docker-image
privileged: true # ⚠️ Dangerous!
config:
platform: linux
image_resource:
type: registry-image
source: {repository: docker}
run:
path: docker
args: [build, -t, myimage, .]
What Privileged Mode Grants
| Access | Security Impact |
|---|---|
| All Linux capabilities | ❌ Container can load kernel modules, change network config |
| Host device access | ❌ Can access /dev/sda (host disks), /dev/mem (physical memory) |
| Apparmor/SELinux bypass | ❌ MAC policies not enforced |
| Cgroup manipulation | ❌ Can escape resource limits |
When Privileged Mode is Necessary
- Docker-in-Docker (DinD): Building Docker images requires nested Docker daemon
- Kernel module testing: Loading/testing kernel modules
- Low-level device access: Flashing firmware, direct block device manipulation
Containerd Configuration in the Charm
The charm configures containerd with Concourse-specific settings:
/etc/containerd/config.toml
# Snapshotter for image layers
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs" # Faster than fuse-overlay
# DNS configuration
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.10"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
# GPU support (when compute-runtime=cuda)
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Key Configuration Choices
- overlayfs snapshotter: Efficient copy-on-write for image layers
- Custom DNS:
containerd-dns-serverconfig option sets task container DNS - CNI plugins: Bridge networking for task isolation
- Runtime switching: GPU workers use
nvidia-container-runtimeinstead of stockrunc
Comparison with Other CI Systems
| CI System | Container Runtime | Isolation Model |
|---|---|---|
| Concourse CI | containerd + runc | Full namespace isolation, user namespacing |
| GitLab Runner | Docker (default) or Kubernetes | Docker-in-Docker or Kubernetes pods |
| Jenkins | Docker plugin (optional) | Varies (can run without containers) |
| GitHub Actions | Docker (self-hosted) or VM (cloud) | Full VMs for cloud runners |
| Drone CI | Docker | Docker containers |
Concourse's advantage: By using containerd directly, Concourse avoids Docker's overhead while maintaining strong isolation. This makes workers lighter and more efficient.
LXD Compatibility: Nested Containers
When workers run inside LXD containers (common for Juju localhost deployments), we have nested containerization:
Host (bare metal)
↓
LXD Container (Juju unit)
↓
containerd (Concourse worker)
↓
Task Container (Concourse task)
This works because:
- LXD supports nesting: Set
security.nesting=trueon LXD container - User namespace delegation: LXD allows inner containers to remap UIDs
- AppArmor profiles: LXD's
lxc-container-default-cgnsprofile permits nested cgroups
security.nesting=true. No manual LXD configuration needed.
Related Topics
- Tutorial: Complete Deployment Guide - See container runtime in action
- Explanation: Understanding GPU Architecture - How GPUs integrate with containerd
- Reference: Configuration Options - See
containerd-dns-*options