Understanding Container Runtime

Why containerd (not Docker), why workers run as root, and the container isolation model

Why Containerd, Not Docker?

The charm uses containerd as the container runtime, not Docker. This is a deliberate choice made by Concourse CI upstream, not this charm. Understanding why requires understanding the container ecosystem evolution:

The Container Stack

┌────────────────────────┐
│  Concourse Worker      │ ← Our charm installs this
├────────────────────────┤
│  containerd            │ ← Container runtime (charm installs)
├────────────────────────┤
│  runc                  │ ← OCI runtime (spawns containers)
├────────────────────────┤
│  Linux Kernel          │ ← Namespaces, cgroups
│  (namespaces, cgroups) │
└────────────────────────┘

Docker, by contrast, adds additional layers:

┌────────────────────────┐
│  docker CLI            │ ← User-facing tool
├────────────────────────┤
│  dockerd               │ ← Docker daemon (image management, networking)
├────────────────────────┤
│  containerd            │ ← Actually runs containers
├────────────────────────┤
│  runc                  │ ← OCI runtime
└────────────────────────┘

Why Containerd is Better for Concourse

Requirement	containerd	Docker
Lightweight	✅ ~50MB binary, minimal daemon	❌ ~200MB, heavyweight daemon with many features Concourse doesn't need
OCI-compliant	✅ Native OCI support	✅ Via containerd backend
No unnecessary features	✅ Just container lifecycle management	❌ Docker Swarm, Docker Compose, legacy image formats
Kubernetes-compatible	✅ Used by Kubernetes since 1.20	❌ Deprecated as K8s runtime
Direct control	✅ Concourse talks directly to containerd	❌ Extra layer of indirection

💡 Industry Trend: Kubernetes deprecated Docker runtime support in v1.20 (2020) and removed it entirely in v1.24 (2022), standardizing on containerd. Concourse CI follows the same principle—use the simplest, most direct runtime possible.

What Concourse Doesn't Need from Docker

docker CLI: Workers are API-driven, not user-interactive
Docker Compose: Pipelines define multi-container workflows, not Compose files
Docker Swarm: Concourse is its own orchestrator
BuildKit: Concourse uses OCI Buildkit directly if needed
Legacy image formats: Modern images are OCI-compliant

Why Workers Run as Root

The concourse-worker systemd service runs as root, not an unprivileged user. This seems surprising given security best practices, but it's necessary for several reasons:

Technical Requirements

Capability Needed	Why Root is Required
Create user namespaces	Containers need isolated UID/GID spaces. Requires `CAP_SYS_ADMIN`.
Mount filesystems	Task containers need bind mounts for caches, inputs, outputs. Requires `CAP_SYS_ADMIN`.
Manage cgroups	Resource limits (CPU, memory) enforced via cgroups. Requires root or `CAP_SYS_ADMIN`.
Network namespaces	Isolated networking per container. Requires `CAP_NET_ADMIN`.
Device access	GPUs, block devices need `/dev` access. Requires root or device ownership.

The Rootless Containers Myth

You might have heard of "rootless Docker" or "rootless Podman." These technologies allow launching containers without root, but with significant trade-offs:

Feature	Rootless Mode	Impact on Concourse
Network modes	❌ No bridge networking	Concourse tasks need network isolation
Port binding <1024	❌ Privileged ports blocked	Some tasks need to bind port 80/443
Cgroup limits	⚠️ Limited enforcement	Can't reliably limit task resources
GPU passthrough	❌ No device access	GPU workers would be impossible
Overlay filesystems	⚠️ Fuse-overlay (slow)	Performance degradation for image layers

Verdict: Rootless mode sacrifices too many features Concourse relies on. The security benefits don't outweigh the functional limitations.

Security Mitigations

Running as root doesn't mean "no security." The charm implements several layers of protection:

Container isolation: Task containers run in namespaces with limited capabilities
AppArmor/SELinux profiles: Kernel-level MAC (Mandatory Access Control)
Seccomp filters: Restrict syscalls available to containers
Network policies: Firewall rules limit worker attack surface
Read-only root filesystem: Worker binary directories mounted read-only

⚠️ Attack Surface: If a worker host is compromised (root access gained), an attacker can access all containers on that worker. Best practice: Run workers in isolated VMs or containers, not directly on sensitive hosts.

Container Isolation Model

Concourse uses Linux namespaces and cgroups to isolate task containers. Understanding this model explains what tasks can and cannot do:

Namespace Isolation

Namespace	What's Isolated	Concourse Usage
PID	Process IDs (containers see own PID 1)	Tasks can't see other task processes
Mount	Filesystem mounts	Each task has own rootfs from image
Network	Network stack (IP, routes, firewall)	Tasks have isolated networking (bridge mode)
UTS	Hostname and domain name	Each task has unique hostname
IPC	Inter-process communication (shared memory, semaphores)	Tasks can't IPC with other tasks
User	UID/GID mappings	Task UID 0 maps to host UID 100000+ (non-privileged)

What Tasks CAN Do

Bind to any port: Network namespace allows full port range (task's port 80 doesn't conflict with other tasks)
Install packages: Tasks have full apt/yum/apk access within their rootfs
Run as root: Inside the container, UID 0 is root (but mapped to unprivileged UID on host)
Create files: Task has full write access to its own filesystem (ephemeral)
Use full CPU: Unless limited by cgroups

What Tasks CANNOT Do

See other tasks: PID namespace prevents ps aux from showing other tasks
Access host filesystem: Mount namespace isolates / unless bind-mounted explicitly
Modify kernel: Seccomp blocks syscalls like reboot, load_module
Escape to host: User namespace remaps root to unprivileged UID
Persist data: Task filesystem is ephemeral (deleted after task completes)

Privileged Containers: The Exception

Concourse supports privileged containers via the privileged: true flag in task configs. This disables most isolation:

task: build-docker-image
privileged: true  # ⚠️ Dangerous!
config:
  platform: linux
  image_resource:
    type: registry-image
    source: {repository: docker}
  run:
    path: docker
    args: [build, -t, myimage, .]

What Privileged Mode Grants

Access	Security Impact
All Linux capabilities	❌ Container can load kernel modules, change network config
Host device access	❌ Can access `/dev/sda` (host disks), `/dev/mem` (physical memory)
Apparmor/SELinux bypass	❌ MAC policies not enforced
Cgroup manipulation	❌ Can escape resource limits

⚠️ Security Warning: Privileged containers can escape to the host. Only use for tasks that absolutely require it (Docker-in-Docker builds, kernel testing). Never run untrusted code in privileged mode.

When Privileged Mode is Necessary

Docker-in-Docker (DinD): Building Docker images requires nested Docker daemon
Kernel module testing: Loading/testing kernel modules
Low-level device access: Flashing firmware, direct block device manipulation

Containerd Configuration in the Charm

The charm configures containerd with Concourse-specific settings:

/etc/containerd/config.toml

# Snapshotter for image layers
[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"  # Faster than fuse-overlay

# DNS configuration
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.10"
[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"

# GPU support (when compute-runtime=cuda)
[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"

Key Configuration Choices

overlayfs snapshotter: Efficient copy-on-write for image layers
Custom DNS: containerd-dns-server config option sets task container DNS
CNI plugins: Bridge networking for task isolation
Runtime switching: GPU workers use nvidia-container-runtime instead of stock runc

Comparison with Other CI Systems

CI System	Container Runtime	Isolation Model
Concourse CI	containerd + runc	Full namespace isolation, user namespacing
GitLab Runner	Docker (default) or Kubernetes	Docker-in-Docker or Kubernetes pods
Jenkins	Docker plugin (optional)	Varies (can run without containers)
GitHub Actions	Docker (self-hosted) or VM (cloud)	Full VMs for cloud runners
Drone CI	Docker	Docker containers

Concourse's advantage: By using containerd directly, Concourse avoids Docker's overhead while maintaining strong isolation. This makes workers lighter and more efficient.

LXD Compatibility: Nested Containers

When workers run inside LXD containers (common for Juju localhost deployments), we have nested containerization:

Host (bare metal)
  ↓
LXD Container (Juju unit)
  ↓
containerd (Concourse worker)
  ↓
Task Container (Concourse task)

This works because:

LXD supports nesting: Set security.nesting=true on LXD container
User namespace delegation: LXD allows inner containers to remap UIDs
AppArmor profiles: LXD's lxc-container-default-cgns profile permits nested cgroups

Automatic Configuration: When deploying to LXD (localhost cloud), Juju automatically sets security.nesting=true. No manual LXD configuration needed.

Concourse CI Machine Charm - Documentation