[placeholder]

Understanding Linux Container Scheduling

At Squarespace we are currently in the process of migrating our many Java microservices from traditional virtual machines to Docker containers running on Kubernetes. Docker containers are complete shippable software packages and dependencies that can often be thought of as lightweight VMs. While this can be a very convenient simplification, it is important to understand how containers are implemented using Linux control groups (cgroups) and namespaces. Understanding the features and limitations has helped us dramatically improve the performance of our Java services especially under stressful scenarios.

All containers running on a host ultimately share the same kernel and resources. In fact, Docker containers are not a first-class concept in Linux, but instead just a group of processes that belong to a combination of Linux namespaces and control groups (cgroups). System resources, such as CPU, memory, disk, and network bandwidth can be restricted by these cgroups, providing mechanisms for resource isolation. Namespaces are then used to limit the visibility of a process into the rest of the system through the use of the ipc, mnt, net, pid, user, cgroups, and uts namespace subsystems. The cgroups namespace is in fact used to limit the view of cgroups; cgroups themselves are not namespaces.

Any process not explicitly assigned to a cgroup is automatically included in the root cgroup. On the CentOS Linux distribution, the root cgroup and any children are mounted as a mutable filesystem at /sys/fs/cgroup. (You can check with mount if you are on a different Linux distribution.) A user with sufficient privileges can easily create cgroups, modify them, or move tasks to them using basic shell commands or the higher-order utilities provided by the libcgroup-tools package. Of particular interest are the cpu and cpuacct cgroup subsystems that are mounted at /sys/fs/cgroup/cpu,cpuacct. The symlink /sys/fs/cgroup/cpu can also be used for simplicity. The cpuacct subsystem is simple—it solely collects CPU runtime information. However, the cpu subsystem schedules CPU access to each cgroup using either the Completely Fair Scheduler (CFS)—the default on Linux and Docker—or the Real-Time Scheduler (RT).

When we run the Docker container image quay.io/klynch/java-simple-http-server, the Docker daemon creates a container and spawns a single Java process within it. The container is assigned a unique identifier of 31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267 by the Docker daemon and will be used later to label and identify the various components that constitute the container. This identifier has no real significance to the kernel, however.

By default, Docker creates a pid namespace for this container, isolating the process from other namespaces; the Java process is attached to this new pid namespace before execution and is assigned PID 1 by the Linux kernel. However, this process is not entirely isolated from other processes on the system. Because PID namespaces are nested, every namespace except for the initial root namespace has a parent namespace. A process running in a namespace can see all processes of child pid namespaces. This means that a process running in the root namespace, such as our shell, can see all processes running on the system. In our example, we can see that the java process has the PID 30968. We can also see the cgroups and namespaces our process is assigned to:

# cat /proc/30968/cgroup
11:cpuacct,cpu:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
10:net_prio,net_cls:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
9:freezer:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
8:memory:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
7:pids:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
6:perf_event:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
5:devices:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
4:blkio:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
3:cpuset:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
2:hugetlb:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
1:name=systemd:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267

# ls -l /proc/30968/ns/*
lrwxrwxrwx 1 root root 0 Jun  7 14:16 ipc -> ipc:[4026532461]
lrwxrwxrwx 1 root root 0 Jun  7 14:16 mnt -> mnt:[4026532459]
lrwxrwxrwx 1 root root 0 Jun  7 15:41 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Jun  7 14:16 pid -> pid:[4026532462]
lrwxrwxrwx 1 root root 0 Jun  7 15:41 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Jun  7 14:16 uts -> uts:[4026532460]

We can also verify the container of a process by grepping for our pid in the file /sys/fs/cgroup/cpu,cpuacct/docker/31dbff344530a41c11875346e3a2eb3c16304661 73cf9f1e8bca498182c0c267/tasks or by running the command systemd-cgls and searching for the process in question. However, this does not tell us what our process is mapped to inside of the container! You can access that by looking at the process status file. Unfortunately, this was introduced in a kernel patch that has not yet been backported to the CentOS 7.3 kernel. However, in practice, it should be simple to identify the appropriate process inside of a container. The following command shows us that our process maps to PID 1 inside of its namespace.

# grep NSpid /proc/30968/status
NSpid:  30968    1

We can verify that the view from inside of our process namespaces is a little different. We can use the docker exec command to run an interactive shell provided our container has a binary for our shell. This command is a much simpler solution for most cases than the nsenter utility. After you run the exec, you will then see a shell prompt that is sharing the same namespaces as our java process, including the pid namespace.

# docker exec -it java-http bash

# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  5.8 18.6 4680796 724080 ?      Ssl  05:10  41:51 java SimpleHTTPServer

Likewise, the cgroup namespace is restricted to the container’s cgroups, further isolating our process from other processes running on the system. This gives us direct access to any potential constraints placed on our container. For example, we can access CPU scheduling and usage directly from within our container. This information will allow us to easily tune the JVM based on allocated resources.

# ls -l /sys/fs/cgroup/cpuacct,cpu
-rw-r--r-- 1 root root 0 Jun  7 05:10 cgroup.clone_children
--w--w--w- 1 root root 0 Jun  7 05:10 cgroup.event_control
-rw-r--r-- 1 root root 0 Jun  7 17:04 cgroup.procs
-rw-r--r-- 1 root root 0 Jun  7 05:10 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Jun  7 05:10 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Jun  7 05:10 cpu.rt_period_us
-rw-r--r-- 1 root root 0 Jun  7 05:10 cpu.rt_runtime_us
-rw-r--r-- 1 root root 0 Jun  7 05:10 cpu.shares
-r--r--r-- 1 root root 0 Jun  7 05:10 cpu.stat
-r--r--r-- 1 root root 0 Jun  7 05:10 cpuacct.stat
-rw-r--r-- 1 root root 0 Jun  7 05:10 cpuacct.usage
-r--r--r-- 1 root root 0 Jun  7 05:10 cpuacct.usage_percpu
-rw-r--r-- 1 root root 0 Jun  7 05:10 notify_on_release
-rw-r--r-- 1 root root 0 Jun  7 05:10 tasks

Scheduling

When we think of containers as lightweight VMs, it is natural to think of resources in terms of discrete resources such as number of processors. However, the Linux kernel schedules processes dynamically, just as the hypervisor schedules requests onto discrete hardware. The default scheduler used in Linux and Docker is CFS, the Completely Fair Scheduler. Scheduling cgroups in CFS requires us to think in terms of time slices instead of processor counts. The cpu cgroup subsystem is in charge of scheduling and can be tuned to support relative minimum resources as well as hard-ceiling enforcements used to cap processes from using more resources than provisioned. Both tunable classes behave differently and may appear confusing at first.

CPU Shares

CPU shares provide tasks in a cgroup with a relative amount of CPU time, providing an opportunity for the tasks to run. The file cpu.shares defines the number of shares allocated to the cgroup. The amount of time allocated to a given cgroup is the number of shares divided by the total number of shares available. This proportional allocation is calculated for each level in the cgroup hierarchy. In CentOS, this begins with the root / cgroup with 1024 shares and 100% of CPU resources. The root cgroup is typically limited to a small number of critical userland kernel processes and the initial SystemD process. The rest of the resources are then offered equally amongst the groups /system.slice (system services), /user.slice (user processes), and /docker (Docker containers) each with an equal weight of 1024 by default.

On minimal CentOS installations, we can typically ignore the impact of system services and user processes. This will allow the scheduler to offer nearly all of the CPU time to the /docker group proportional to each container’s share. If there are three containers with weights of 2048, 1024, and 1024 on a four-core system, the first cgroup will be allocated the equivalent of two cores, and the two remaining cgroups will each be given the equivalent of one core. If all of the tasks in a cgroup are idle and not waiting to run, any unused shares are then placed in a global pool for other tasks to consume. Thus, if there is a single task in the first cgroup, the unused shares will be placed back into the global pool.

CPU Quotas

While CPU shares are unable to guarantee a minimum amount of CPU time without complete control of the system, it is much easier to enforce a hard limit to the CPU time allocated to processes. CPU bandwidth control for CFS

was introduced to prevent tasks from exceeding the total allocated CPU time for a given cgroup. By default, quotas are disabled for a cgroup with cpu.cfs_quota_us set to –1. If enabled, CFS quotas will permit a group to execute for up to cpu.cfs_quota_us microseconds within a period of cpu.cfs_period_us microseconds, (default of 100 ms.). If a group’s tasks are unconstrained, they will be permitted to use as many unused resources available on the host. By adjusting a cgroup’s quota relative to the period we can effectively assign entire cores to a group! A quota of 100 ms. will allow tasks in that group to run for a total of 100 ms. during that entire 100 ms. window. Within the container we can easily calculate the number of cores available to us by dividing cpu.cfs_quota_us by cpu.cfs_period_us, allowing us to scale our processes appropriately!

If two tasks are executing in the same cgroup on different cores, each task will contribute to the quota. If the entire quota is eliminated while there are still tasks waiting to execute, the group will get throttled, even if a host has unused processor resources. The number of periods and the accumulated amount of time in nanoseconds a cgroup has been throttled is reported in the cpu.stat file as the nr_throttled and throttled_time statistics. Likewise, if there are enough tasks left in the waiting state long enough, we may see the load average increase. Performance tools like Netflix Vector can help easily identify throttled containers. The systemd-cgtop utility can also be used to show how many resources are being consumed by each cgroup.

When scheduling containers using quotas, it is important that the processes are provided an appropriate window of time to execute. If a cgroup is consistently being throttled, it likely is not being allocated enough resources. This is particularly true when running complex systems like the JVM that make many assumptions about the system it is running on. Because the JVM is still able to see the number of cores on a running system, it will size the number of garbage collector threads to the number of physical cores on the host, regardless of its quota limit. This can lead to disastrous consequences when running the JVM on a 64 core machine but limiting it to the equivalent of 2 cores, as the garbage collector can result in longer than expected application pauses. Additionally, the use of Java 8 parallel stream features can cause similar issues.

We prevent our container from being throttled prematurely and permits more opportunities for our application threads to execute by limiting the number of JVM threads to at most the number of cores available. Our base Docker image automatically detects the resources available to the container and tunes the JVM accordingly at start time. Setting the flags -XX:ParallelGCThreads, -XX:ConcGCThreads, and -Djava.util.concurrent.ForkJoinPool.common.parallelism prevents many unnecessary pauses. However, many JVM components rely on Runtime.getRuntime.availableProcessors() which still returns the number of physical cores available. To overcome this, we compile and load a C library that overrides the JVM native function JVM_ActiveProcessorCount and returns our calculated value instead. This gives us complete control to limit all dynamically scalable aspects of the JVM without performance penalties.

Conclusion

In this post we have looked at how Linux cgroups assign and schedule resources to Docker containers. Because resource requirements are highly variable, it is typically not possible to predictably partition resources. However, cgroups allow us to sanely partition the resources and easily schedule our container- based processes using the Completely Fair Scheduler. While this post focused on the use of cgroups v1, it is important to know that this will be changing in the future. A second version of cgroups was introduced in the 4.5 kernel to simplify the complexities of the first version. This version removes hierarchies, introducing a new model that is simpler to implement and understand. However, the scheduling features are still being worked out and will likely not be introduced into the RHEL kernel for a very long time. For now, we must rely on the first version to limit and schedule our containers. By understanding how cgroups operate, we are able to tune the JVM appropriately and schedule many Java microservice instances on a single machine without any performance loss. This will allow us to continue to rapidly convert our microservices to containers and drastically simplify our deployment process.

If these types of challenges interest you, the Squarespace SRE team would love to hear from you!

The Nuts and Bolts with Ryan Gee

Under the Hood: Ensuring Site Reliability