淺談 Linux Containers (LXC)

Shih-Yuan Lee (FourDollars)

@ COSCUP 2014.07.19

OK, 謝謝 Penk 的介紹,那我們就開始吧。


自由軟體工作者,目前在 Canonical 公司服務,負責開發 Ubuntu OEM 衍生版本,平時會在台北參加 TOSSUG 以及 Hacking Thursday 的社群聚會。


Canonical 徵人


台北 101 大樓 46-47F



Linux 上的虛擬化技術

Virtualization on Linux

這裡算是前情提要,首先簡單介紹一下 Linux 上面使用到的虛擬化技術,這邊講的以 x86 為主。


利用 Intel Virtualization Technology (Intel VT) 與 AMD Virtualization (AMD-V) 等 CPU 虛擬化支援功能,將 PC 硬體上運作的 OS 直接拿到虛擬機器下執行。

Full virtualization

Almost complete simulation of the actual hardware to allow software, which typically consists of a guest operating system, to run unmodified.

完全虛擬化就是模擬整個電腦,讓一般安裝在實體電腦上的 OS 也能夠安裝到虛擬環境裡面,不用做額外的修改,但是如果要跑得更順暢的話, 就是要安裝額外的驅動程式,有在使用 VirtualBox 的人應該知道,在 VirtualBox 裡面安裝完一個 Linux 系統後需要再安裝一些驅動程式。


這是從 WikiPedia 上面找來的圖片,大概長得像是這樣,裡面每個 Guest OS 都有自己的虛擬硬體。


Guest OS 需要知道自己在虛擬化環境底下執行,Kernel 與驅動程式必須修正。半虛擬化方式的 guest OS 稱為 PV guest;半虛擬化方式的驅動程式稱為 PV driver。


A hardware environment is not simulated; however, the guest programs are executed in their own isolated domains, as if they are running on a separate system. Guest programs need to be specifically modified to run in this environment.

而半虛擬化則是要安裝修改過的 Linux kernel 跟驅動程式,好吧。。。 我目前不是熟這些東西,只是在這裡提出來有這樣的東西。


這是從 Fedora 上面找來的圖片


由作業系統提供的功能來隔離 guest OS 的執行環境,但是共用 Host OS 上面的 Kernel,在 guest OS 裡面看起來就像是一個獨立的環境。

Operating system-level virtualization

The same OS kernel is used to implement the "guest" environments. Applications running in a given "guest" environment view it as a stand-alone system.

現在這個就是今天要講的主題,也就是它不屬於完全虛擬化跟半驅擬化, 它只是創建了一個特別的容器,而這個容器裡面所使用的 Linux Kernel 跟外面是同一個, 只是系統環境被 Linux kernel 所提供的一些功能給隔開了。

Linux Containers

官方網站 https://linuxcontainers.org

"LXC is often considered as something in the middle between a chroot on steroids and a full fledged virtual machine. The goal of LXC is to create an environment as close as possible as a standard Linux installation but without the need for a separate kernel."

“LXC 往往被視為在加強版的 Chroot 環境和一個完全成熟的虛擬機器之間的某種存在。LXC 的目標是創造一個盡可能接近標準的 Linux 安裝環境,但是不需要額外的系統內核。”

現在講到今天的主題 Linux Container,上面的敘述是從官方網站引述的。(照著中文唸一遍)


根據 2014.07.16 的統計資料 by git shortlog -sne

551  Stéphane Graber <stgraber [at] ubuntu.com>
529  Serge Hallyn <serge.hallyn [at] ubuntu.com>
243  Dwight Engen <dwight.engen [at] oracle.com>
200  Daniel Lezcano <daniel.lezcano [at] free.fr>
190  dlezcano <dlezcano>
140  Daniel Lezcano <dlezcano [at] fr.ibm.com>
116  Michel Normand <normand [at] fr.ibm.com>
 80  KATOH Yasufumi <karma [at] jazz.email.ne.jp>
 77  S.Çağlar Onur <caglar [at] 10ur.org>
 65  Christian Seiler <christian [at] iwakd.de>
 59  Natanael Copa <ncopa [at] alpinelinux.org>
 47  Serge Hallyn <serge.hallyn [at] canonical.com>
 29  Michael H. Warfield <mhw [at] WittsEnd.com>
 26  Qiang Huang <h.huangqiang [at] huawei.com>

我們先來看一下開發者成員,上面在 2014.07.16 在 git repository 上面執行後面那段指令之後的輸出結果。



576  Serge Hallyn <serge.hallyn [at] ubuntu.com>
551  Stéphane Graber <stgraber [at] ubuntu.com>
530  Daniel Lezcano <dlezcano [at] fr.ibm.com>
243  Dwight Engen <dwight.engen [at] oracle.com>
116  Michel Normand <normand [at] fr.ibm.com>

原作者 Daniel Lezcano 來自 IBM

主要的商業公司支援來自 Canonical, IBM, Oracle

接著我們把重覆的部份合併,就可以發現主要是這三間公司聘請全職的開發人員在做貢獻。 為什麼 Canonical 也就是敝公司會投入 lxc 的開發呢?

Ubuntu 相關的應用


首先是 Ubuntu Juju

Ubuntu Juju 是一個雲端快速建構的工具跟平台,目標是讓使用者輕鬆無痛地建立起網站, 如果是你是在本機上安裝使用它,就是會使用到 LXC, 這裡有一段 YouTube 的影片大家會後可以看一下,不過我們先來看一下 Demo

接下來再來看一下 Ubuntu Touch

Ubuntu Touch 是 Canonical 為了手機與平板所開發的一套系統,它與一般的 Ubuntu 共用所有的軟體套件, 但是額外新增了一些軟體散布的機制,我們來快速看一下 Ubuntu Touch 內部設計的文件,看哪裡有用到 LXC。

Linux kernel 提供的功能

man lxc

    * General setup
      * Control Group support
        -> Namespace cgroup subsystem
        -> Freezer cgroup subsystem
        -> Cpuset support
        -> Simple CPU accounting cgroup subsystem
        -> Resource counters
          -> Memory resource controllers for Control Groups
      * Group CPU scheduler
        -> Basis for grouping tasks (Control Groups)
      * Namespaces support
        -> UTS namespace
        -> IPC namespace
        -> User namespace
        -> Pid namespace
        -> Network namespace
    * Device Drivers
      * Character devices
        -> Support multiple instances of devpts
      * Network device support
        -> MAC-VLAN support
        -> Virtual ethernet pair device
    * Networking
      * Networking options
        -> 802.1d Ethernet Bridging
    * Security options
      -> File POSIX Capabilities

我們來看 Linux kernel 裡面提供了哪些功能,如果你去 man lxc 這個指令, 你就會看到裡面有一段 Linux kernel 編譯選項的敘述, 如果去 Linux kernel source tree 裡面去找這些編譯選項的說明就會看到接下來的東西。


Control Group support

This option adds support for grouping sets of processes together, for use with process control subsystems such as Cpusets, CFS, memory controls or device isolation.


- Documentation/scheduler/sched-design-CFS.txt   (CFS)
- Documentation/cgroups/ (features for grouping, isolation
                          and resource control)

Control Group 又稱為 cgroup 是主要的功能選項,接下來許多 cgroup subsystem 又稱為 controller 都是依賴在這個選項之下。

cgroup 的功能是讓 process 能夠分開在不同的 group 裡面,然後我們可以對每個 group 透過 controller 做不同的操作。


Namespace cgroup subsystem

Provides a simple namespace cgroup subsystem to provide hierarchical naming of sets of namespaces, for instance virtual servers and checkpoint/restart jobs.


Namespace controller 是讓 cgroup 去使用到 namespace 功能。

namespace 是另外一個主要的功能,等一下會做比較詳細的說明,這裡先跳過。


Freezer cgroup subsystem

Provides a way to freeze and unfreeze all tasks in a cgroup.

看一下大概就知道這是用來凍結所有 process 的東西。


Cpuset support

This option will let you create and manage CPUSETs which allow dynamically partitioning a system into sets of CPUs and Memory Nodes and assigning tasks to run only within those sets. This is primarily useful on large SMP or NUMA systems.

簡單說就是指定 process 能夠跑在哪一個 CPU 上面。


Simple CPU accounting cgroup subsystem

Provides a simple Resource Controller for monitoring the total CPU consumed by the tasks in a cgroup.

統計每個 process 的 CPU 使用量。


Resource counters

This option enables controller independent resource accounting infrastructure that works with cgroups.



Memory resource controllers for Control Groups

Provides a memory resource controller that manages both anonymous memory and page cache. (See Documentation/cgroups/memory.txt)

Note that setting this option increases fixed memory overhead associated with each page of memory in the system. By this, 8(16)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory usage tracking struct at boot. Total amount of this is printed out at boot.

Only enable when you're ok with these trade offs and really sure you need the memory resource controller. Even when you enable this, you can set "cgroup_disable=memory" at your boot option to disable memory resource controller and you can avoid overheads. (and lose benefits of memory resource controller)

This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead.



Group CPU scheduler

This feature lets CPU scheduler recognize task groups and control CPU bandwidth allocation to such task groups. It uses cgroups to group tasks.

Process 的 CPU 排程的控制。


Namespaces support

Provides the way to make tasks work with different objects using the same id. For example same IPC id may refer to different objects or same user id or pid may refer to different tasks when used in different namespaces.

讓容器裡面可以使用跟容器外面一樣的 ID ,例如 Process ID / User ID / IPC ID,


例如,容器內有 init 它的 PID 是 1,容器外面也有 init 它的 PID 也是 1, 但是容器裡面的 init 從容器外面來看就不是 1 了,而是其它的數字。

來實際看一下 init 的例子。


UTS namespace

In this namespace tasks see different info provided with the uname() system call

讓容器內的 uname 跑出不一樣的結果。(以 sudo lxc-start -n wheezy-sh4 裡面的 uname -m 為例)


IPC namespace

In this namespace tasks work with IPC ids which correspond to different IPC objects in different namespaces.

讓 IPC ID 在容器內獨立。


User namespace

This allows containers, i.e. vservers, to use user namespaces to provide different user info for different servers.

When user namespaces are enabled in the kernel it is recommended that the MEMCG and MEMCG_KMEM options also be enabled and that user-space use the memory control groups to limit the amount of memory a memory unprivileged users can use.

讓 User ID 在容器內獨立,並且可以讓一般的 User ID 受到某些記憶體使用量的限制。


Pid namespace

Support process id namespaces. This allows having multiple processes with the same pid as long as they are in different pid namespaces. This is a building block of containers.

讓 Process ID 在容器內獨立。


Network namespace

Allow user space to create what appear to be multiple instances of the network stack.

允許用戶空間可以建立多個網路實體,就很多 Ethernet interface 的樣子。


Support multiple instances of devpts

Enable support for multiple instances of devpts filesystem. If you want to have isolated PTY namespaces (eg: in containers), say Y here. Otherwise, say N. If enabled, each mount of devpts filesystem with the '-o newinstance' option will create an independent PTY namespace.

在容器內建立 /dev/tty1 之類的東西,等一下會提到 lxc-console 這個指令會使用到這個功能。


MAC-VLAN support

This allows one to create virtual interfaces that map packets to or from specific MAC addresses to a particular interface.

Macvlan devices can be added using the "ip" command from the iproute2 package starting with the iproute2-2.6.23 release:

"ip link add link <real dev> [ address MAC ] [ NAME ] type macvlan"

To compile this driver as a module, choose M here: the module will be called macvlan.



Virtual ethernet pair device

This device is a local ethernet tunnel. Devices are created in pairs. When one end receives the packet it appears on its pair and vice versa.

將 Linux Container 裡面的網路跟外面的網路連接在一起,有點像是虛擬網路線對接。


802.1d Ethernet Bridging

If you say Y here, then your Linux box will be able to act as an Ethernet bridge, which means that the different Ethernet segments it is connected to will appear as one Ethernet to the participants. Several such bridges can work together to create even larger networks of Ethernets using the IEEE 802.1 spanning tree algorithm. As this is a standard, Linux bridges will cooperate properly with other third party bridge products.

In order to use the Ethernet bridge, you'll need the bridge configuration tools; see <file:Documentation/networking/bridge.txt> for location. Please read the Bridge mini-HOWTO for more information.

If you enable iptables support along with the bridge support then you turn your bridge into a bridging IP firewall. iptables will then see the IP packets being bridged, so you need to take this into account when setting up your firewall rules. Enabling arptables support when bridging will let arptables see bridged ARP traffic in the arptables FORWARD chain.

將一個 Ethernet 當成好多不同的 Ethernet 使用,但是實際上是同一個 Ethernet 實體裝置。


File POSIX Capabilities

This enables filesystem capabilities, allowing you to give binaries a subset of root's powers without using setuid 0.

(Removed from linux kernel 2.6.33 and above versions.)


以 Ubuntu 14.04 為例

安裝 lxc

$ sudo apt-get install lxc lxc-templates

接下來簡單介紹幾個 lxc 的指令,首先當然要先安裝到系統上面才可以使用。

簡查一下系統是否支援 LXC

$ lxc-checkconfig
Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-3.13.0-32-generic
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled
Multiple /dev/pts instances: enabled

--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
File capabilities: enabled

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig

查看有哪些 Templates 可以使用

$ tree /usr/share/lxc/templates
├── lxc-alpine
├── lxc-altlinux
├── lxc-archlinux
├── lxc-busybox
├── lxc-centos
├── lxc-cirros
├── lxc-debian
├── lxc-download
├── lxc-fedora
├── lxc-gentoo
├── lxc-openmandriva
├── lxc-opensuse
├── lxc-oracle
├── lxc-plamo
├── lxc-sshd
├── lxc-ubuntu
└── lxc-ubuntu-cloud

0 directories, 17 files

產生 Debian sid (amd64) 為例

每個 Template 都有自己的使用說明

$ sudo lxc-create -t debian -h

產生 Create

$ sudo lxc-create -t debian -n sid -- -r sid -a amd64

摧毀 Destroy

$ sudo lxc-destroy -n sid

操作 Linux Container

啟動 Start

$ sudo lxc-start -d -n sid

凍結 Freeze

$ sudo lxc-freeze -n sid

解凍 Unfreeze

$ sudo lxc-unfreeze -n sid

停止 Stop

$ sudo lxc-stop -n sid

查詢 Linux Container


$ sudo lxc-ls -f
NAME            STATE    IPV4       IPV6  AUTOSTART
sid             FROZEN  -     NO


$ sudo lxc-info -n sid
Name:           sid
State:          FROZEN
PID:            13843
CPU use:        0.59 seconds
Memory use:     24.69 MiB
KMem use:       0 bytes
Link:           vethL2RL9Y
 TX bytes:      2.49 KiB
 RX bytes:      24.61 KiB
 Total bytes:   27.09 KiB

進入 Linux Container

$ sudo lxc-console -n sid

這裡就是前面有提到的一個 devpts 的 Linux kernel 編譯選項, 這邊就是模擬純 console 環境的 tty1, 你可以重複執行這個指令來取得 tty2, tty3 以此類推。



介紹一些架構在 LXC 上的應用

Steam for Linux


Running in a LXC container on Ubuntu

在 Ubuntu 12.04 上面的 Demo http://youtu.be/IorxJsw09vY

sudo apt-add-repository ppa:ubuntu-lxc/stable
sudo apt-get update
sudo apt-get install steam-lxc
sudo mkdir -p /var/lib/lxc /var/cache/lxc
sudo steam-lxc create
sudo steam-lxc run

LXC Web Panel


LXC provider for Vagrant




Docker running under Juju


Project Atomic




姓名標示 4.0 國際 (CC BY 4.0)



投影片是用 Hovercraft 製作的

