A Linux sysadmin's introduction to cgroups

Author: FedericoRazzoli

Score: 285

Comments: 52

Date: 2020-11-06 16:41:36

________________________________________________________________________________

WestCoastJustin wrote at 2020-11-06 17:00:51:

This is getting a little long in the tooth but I created a few videos and did some tests around cgroups back in 2013-2014. Both these help explain the underpinnings of Docker and I guess Kubernetes now too. Turtles all the way up. I'm just mentioning this since these are still some of my most popular videos and still the core tech under the hood.

A personal note here too. cgroups were invented at Google in the early 2000's. If you're using search today, gmail, docs, maps, etc. You're using cgroups. It sounds simple but this tech really powered a whole wave of innovation/startups/projects that almost everyone interacts with on a daily basis. Either through touching a Google service or interacting with anyone using Docker or Kubernetes (running on cgroups/namespaces). Pretty impressive.

14 - Introduction to Linux Control Groups (Cgroups)

https://sysadmincasts.com/episodes/14-introduction-to-linux-...

24 - Introduction to Containers on Linux using LXC

https://sysadmincasts.com/episodes/24-introduction-to-contai...

contingencies wrote at 2020-11-06 18:47:57:

_cgroups were invented at Google in the early 2000's._

In my understanding, most of the initial commits (at least for LXC, which was the initial userspace) came from IBM, who funded it with interest in kernel-level resource balkanization for their largest mainframes. Google's kubernetes only appeared post-facto after docker, which itself was basically an lxc wrapper for a long time.

Source: I corresponded with two of the authors ~2009, was an early lxc user, and provided some security fixes for early docker.

WestCoastJustin wrote at 2020-11-06 18:51:41:

I don't believe so. "Engineers at Google (primarily Paul Menage and Rohit Seth) started the work on this feature in 2006 under the name "process containers"." [1] Here's the kernel patch from Google that was merged into the Linux Kernel from 2006 [2]. Google used this tech in their Borg cluster manager that way pre-dates Docker/Kubernetes/etc. I'm sure other folks jumped in an helped out after the fact but this was created at Google.

[1]

https://en.wikipedia.org/wiki/Cgroups

[2]

https://lwn.net/Articles/199643/

contingencies wrote at 2020-11-06 18:57:27:

That's interesting. OpenVZ (CC'd on the patch) and FreeBSD jails also pre-date mainstream containers on Linux. However, having been using Linux since the mid 1990s I feel confident in saying public use and development of container-like technologies on mainstream linux definitely dates from circa the LXC userspace and not before. This was ~2009-2010.

ar-jan wrote at 2020-11-07 00:45:56:

There's also Linux Vserver (2001), which used to be included in Debian, but is still in use today.

nix23 wrote at 2020-11-06 20:03:52:

But if your talking about system virtualization, System 370 invented that way before FreeBSD invented jails...and speaking of jails..even chroot on BSD came before that, but he is talking about cgroups.

msla wrote at 2020-11-06 23:00:06:

If you're talking about what I think you're talking about, IBM's VM is a hypervisor, which is a world away from cgroups, chroot, jails, or, say, the Java VM, before anyone brings _that_ up.

WestCoastJustin wrote at 2020-11-06 19:02:57:

Yeah, I think you're right about the BSD jails point. This totally pre-dates the Linux piece. I was only chatting about the Linux piece.

_jal wrote at 2020-11-06 19:57:21:

BSD jails date back to FreeBSD 4, which was released in 2000.

Edit: I started questioning my memory of this, so I poked around some. Poul-Henning Kamp did the initial work in April, 1999, here is the commit:

https://svnweb.freebsd.org/base?view=revision&revision=46155

More details from him about it:

http://phk.freebsd.dk/sagas/jails/

WestCoastJustin wrote at 2020-11-06 20:09:15:

Awesome, thanks _jal!

contingencies wrote at 2020-11-06 19:05:46:

I am pretty sure that OpenVZ also predates the Google work in a Linux context.

Basically saying that 'containers on Linux' are 'made at Google' is not true.

ceph_ wrote at 2020-11-06 19:23:07:

But the OP doesn't state containers on linux were invented at Google. It says cgroups were.

antb123 wrote at 2020-11-06 21:47:37:

Mid 90s ran some vm system of MSDOS. It added threading and was for phone systems.

bogomipz wrote at 2020-11-07 01:27:53:

Indeed from the OpenVZ wiki's history section:

>"Nov 1999: Alexander Tormasov visited Singapore and proposed a new direction to Sergey Beloussov: container virtualization. He formulated three main components: containers as a set of processes with namespace isolation, file system to share code/ram and isolation in resources.

Indeed it was 1999 when our engineers started adding bits and pieces of containers technology to Linux kernel 2.2. Well, not exactly "containers", but rather "virtual environments" at that time -- as it often happens with new technologies, the terminology was different (the term "container" was coined by Sun only five years later, in 2004)."[1]

[1]

https://wiki.openvz.org/History

fulafel wrote at 2020-11-06 20:14:02:

Intersting Linux history - Do you by a chance have a link to the commit ?

I picked a file from the patch index mail linked above, mm_inline.h, and went scrolling through its history at

https://github.com/torvalds/linux/commits/master/include/lin...

but didn't see a corresponding change there. I guess the patches might have gotten refactored before merging or something too but would be nice to have a pointer in the Linux tree history that would work as a reference.

edit: tried looking for linux/container.h too but that just brings up a newer acpi container related container.h:

https://github.com/torvalds/linux/commits/master/include/lin...

edit 2: history for linux/cgroup.c starts in a year later in 2007, with this commit:

https://github.com/torvalds/linux/commit/ddbcc7e8e50aefe467c...

- it has mentions of people from OpenVZ, IBM and a bunch of unaffiliated domains in addition to the signed-off-by line from a Googler (Paul Menage).

merb wrote at 2020-11-06 20:56:24:

I think the patch never made it trough mainline and the work on the cgroups framework started. with this patch.

It basically says:

        Based originally on the cpuset system, extracted by Paul Menage
    *  Copyright (C) 2006 Google, Inc

    *  Copyright notices from the original cpuset code:
    *  --------------------------------------------------
    *  Copyright (C) 2003 BULL SA.
    *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
    *
    *  Portions derived from Patrick Mochel's sysfs code.
    *  sysfs is Copyright (c) 2001-3 Patrick Mochel
    *
    *  2003-10-10 Written by Simon Derr.
    *  2003-10-22 Updates by Stephen Hemminger.
    *  2004 May-July Rework by Paul Jackson.

just wow how old this is.

bogomipz wrote at 2020-11-07 01:21:40:

I believe this is the original whitepaper from 2007.

"Adding Generic Process Containers to the Linux Kernel"

https://www.kernel.org/doc/ols/2007/ols2007v2-pages-45-58.pd...

Although the paper was written by Paul B. Menage it looks like Balbir Singh

and Srivatsa Vaddagiri from IBM also made contributions to it.

caymanjim wrote at 2020-11-06 17:12:34:

Kubernetes is just another layer on top of Docker. It can use other container execution environments, but Kubernetes itself doesn't manage the container runtime; it just orchestrates containers at a higher level. If you're running Kubernetes, you're almost always still running Docker.

WestCoastJustin wrote at 2020-11-06 17:20:18:

Yeah, funny how it all worked out. cgroups/namespaces powered simple LXC containers. Docker comes along and makes a nice workflow and package management system (wrapping cgroups/namespaces). Kubernetes comes along and makes a nice workflow/cluster management layer (wrapping Docker). Cloud providers come along and make a nice Kubernetes management layer (wrapping Kubernetes). Pretty crazy to see the evolution over the past few years.

This tech has completely changed the sysadmin landscape/job descriptions and sort of threw tons of gas the whole devops movement.

Disclaimer: I worked at both Docker & Google. Although not on this tech specifically. Opinions are my own here.

tyingq wrote at 2020-11-06 17:26:43:

_"threw tons of gas on the whole devops movement"_

I'm happy to find that I agree with you on at least one area of this new ecosystem :)

peterwwillis wrote at 2020-11-06 19:17:35:

This is literally my job and I kind of hate how unnecessarily complex it all is. Not in the sense of "distributed systems are hard", but that we don't actually need the onion to have 20 layers, we really only need 2 or 3.

The reason we need an orchestration system for our orchestration system, is the orchestration system is a snowflake. We need an orchestration system because the container system is a snowflake. We need a container system because containers are a snowflake. None of it is really standard or easy to implement, because none of it was designed in the spirit of Unix. It was all just some random companies who threw some crap together so they could start making money selling ads or hosting services. (I don't mean cgroups, I mean the tools that use them, although cgroups are kind of warty in themselves)

All most people need to do is run a couple processes on a couple hosts, send packets between services (and networks), and have something start/stop/restart those processes. You can have all of that with, say, systemd (ugh), or three different standard interfaces that any program can use with standard libraries/ABIs. Notice I didn't say "with 5 different daemons running on 3 hosts that need complex configuration, constant maintenance, and a migration effort every 3 months".

The Open Container Initiative seems close to getting the first thing done. If we get the other two standardized, we can lose a whole bunch of the onion layers. Consequently a lot of people will need to find a new way to make money, because there'll be not nearly as much need to pay people to deal with the onion.

baq wrote at 2020-11-06 20:27:42:

The way you described it one has to wonder what would the tech landscape look like had Plan 9 received adoption instead of Linux, so you didn’t have to invent all those layers of complexity to manage services on multiple hosts. Just mount the network interface, cpu, ram and start multiple processes...

yjftsjthsd-h wrote at 2020-11-06 21:35:58:

Nah, we'd still find ways to make it complex. I mean, yes, I think Plan 9 would've been a nicer base to work from and maybe reduced it a bit, but people are _excellent_ at making things more complex, sometimes for legitimate reasons. Off the top of my head, I don't _think_ Plan 9 has a good "orchestration" story itself (ex. "run 17 instances of this process wherever there's enough free resources in our 100-server cluster"), and while the "Plan 9 k8s" of this alternate universe would be simpler than the k8s we have in reality, it would still exist and add one or more layers of abstraction, and then people would stick management layers in front of that, and...

ColanR wrote at 2020-11-07 05:13:35:

> I don't think Plan 9 has a good "orchestration" story itself

I'm not super familiar with plan 9, but given what I do know, I'm pretty sure that could be done with an rc script. I'm pretty sure that a) you can fork processes after starting them on remote systems, and b) you can query remote systems to figure out which ones have free resources. Given that, it's just a shell / rc script away. Then you just need another script to do the management. ;)

sjy wrote at 2020-11-06 23:29:53:

> All most people need to do is run a couple processes on a couple hosts … You can have all of that with, say, systemd (ugh)

Why ugh? I do have all of that with systemd, and it’s great. Rather than running each service in a separate filesystem and requiring them to communicate over a virtual network, I use ProtectSystem and ProtectHome to keep services as isolated as they need to be. Rather than creating a network bridge and doing NAT inside my server, if services require network isolation, I use NetworkNamespacePath to assign them to namespaces that contain different VLAN interfaces, which are configured in my router just like physical interfaces. Rather than building a huge container image for every application, I use my OS’s package manager to install apps and manage their dependencies. The skills required to do this can be applied to any systemd-based Linux distribution, and are needed to properly understand and troubleshoot systemd-based Linux distributions running in containers anyway. I’d need some new tools (eg. Ansible) to scale this setup past one or two hosts, but I’m baffled by the people setting up complex Docker/Kubernetes systems just to self-host a few web apps.

ColanR wrote at 2020-11-07 05:09:47:

I'd render a guess that the 'ugh' has to do with the general assessment that systemd falls prey to the same flaws that GP identified in the onion layers of orchestration and containerization systems: it's unnecessarily complicated.

b2ccb2 wrote at 2020-11-06 23:17:35:

Nitpick... LXC containers is the equivalent to saying the HIV virus, or more recently SARS-CoV-2 virus

jlgaddis wrote at 2020-11-07 02:10:07:

Nobody cares.

kbenson wrote at 2020-11-06 17:33:40:

> If you're running Kubernetes, you're almost always still running Docker.

Red Hat is trying to base all their stuff on Podman, which is Docker rewritten to by a script instead of a client/server (but can be renamed and used as a drop-in Docker replacement with the same command API). If they have their way, that statement will not longer be quite as correct.

Spivak wrote at 2020-11-06 17:55:42:

Podman is meant as a replacement for the docker command as it’s typically used by developers and for bespoke services that can use systemd for lifecycle management. podman is an ergonomic runc wrapper.

Redhat’s answer to capital D Docker is cri-o.

CameronNemo wrote at 2020-11-06 18:34:01:

To add to this, cri-o and libpod/podman are increasingly sharing code.

tofflos wrote at 2020-11-06 20:54:05:

These videos were absolutely amazing. Thank you.

FedericoRazzoli wrote at 2020-11-06 20:49:31:

Fantastic videos! Thanks for the links!

Yes, Kubernetes uses cgroup too.

Systemd units are also based on cgroups.

LeCow wrote at 2020-11-06 17:10:07:

Dude, those videos are awesome!

Thank you.

tonetheman wrote at 2020-11-06 18:31:14:

Man your site is GREAT. Super clear videos. Good stuff.

aschatten wrote at 2020-11-06 22:55:50:

This is Part I from a series.

Part II:

https://www.redhat.com/sysadmin/cgroups-part-two

Part III:

https://www.redhat.com/sysadmin/cgroups-part-three

Part IV:

https://www.redhat.com/sysadmin/cgroups-part-four

jeffbee wrote at 2020-11-06 17:10:49:

These things should strive for accuracy and this article is not accurate. The cgroups facility does not control "the number of CPU shares per process." Although you can put such a thing into effect with control groups, it's more accurate to say that a control group limits the resources of a set of tasks. Those tasks may be from one or several processes, and it's also the case that a single process can divide its own tasks into several cgroups.

stuff4ben wrote at 2020-11-06 17:39:11:

While I agree that we should strive for accuracy, you're really splitting hairs there. Plus the author puts a disclaimer in there "NOTE: This is a gross simplification and is NOT technically accurate should you be looking to get involved with the underlying cgroup code. The above explanation is meant for clarity of understanding."

jeffbee wrote at 2020-11-06 17:47:04:

But the explanation does _not_ contribute to clarity of understanding. This has nothing to do with hacking on cgroups itself. As a user of cgroups you should know that one group may control multiple processes, and one process may be controlled by multiple groups! There certainly is not a 1:1 correspondence.

kortilla wrote at 2020-11-06 22:55:58:

It’s not an internal detail, it’s fundamental to understanding it from a user perspective.

waynesonfire wrote at 2020-11-06 17:29:48:

Hey webster, whats a task?

craigsmansion wrote at 2020-11-06 19:57:28:

Ah yes, "cgroups", according to notable no-nonsense kernel hacker Al Viro:

"it's not just badly written kernel/cgroup.c - the interfaces on _both_ sides (userland and the rest of kernel) are seriously misdesigned. As far as I'm concerned, configuring it out solves my problem nicely."

That was in 2011, so things might have improved. What remains however is that cgroups was added to the kernel, by Googlers, for easier maintenance, but with an implicit understanding that no sane person would actually make use of it to do something important.

... enter SystemD.

jeffbee wrote at 2020-11-06 20:07:18:

The whole google is built on cgroups v1, which seems to stand as an existence proof that these interfaces are not useless.

yjftsjthsd-h wrote at 2020-11-06 21:37:53:

"bad design" and "useful" can be overlapping groups

PaulDavisThe1st wrote at 2020-11-07 00:52:19:

so when someone is using gmail on a macOS system, how are they using cgroups in any way? how is "the whole google" built on this? perhaps you means "google's web-facing servers using cgroups extensively" ?

jeffbee wrote at 2020-11-07 01:50:07:

That is a somewhat weird and dare I say pointless take. Thousands of machines are involved in serving every request from a web browser to gmail. Of those, only the web browser might not be in linux cgroups.

grantseltzer wrote at 2020-11-06 20:47:58:

I wrote a blog post on cgroups a couple years ago that's still accurate and goes further into depth and gives workable examples of using them both inside and outside of containers:

https://www.grant.pizza/blog/understanding-cgroups/

flexops wrote at 2020-11-06 18:26:35:

Found an interesting ongoing blogpost series that also tries to explain all the other low level kernel mechanism that make up Docker and other container technologies on Linux. They haven't reach the topic of control groups though.

https://www.schutzwerk.com/en/43/posts/linux_container_intro...

FedericoRazzoli wrote at 2020-11-06 20:56:07:

Still work in progress and the topics are not too in-depth... but it's interesting, thanks.

ape4 wrote at 2020-11-06 19:11:10:

systemd uses cgroups so nearly every Linux box is using them.

dianhuji1 wrote at 2020-11-06 20:04:34:

so great info

meatspaceping wrote at 2020-11-07 01:29:35:

Conspiracy theory. our kubernetes clusters as a service are not really running on VMs.