1 Virtual memory

As discussed on its own page, virtual memory is an almost-universally-implemented isolation tool that causes each process to have its own address space. Coupled with a few related techniques for separating registers values between processes, this means that each process’s local variables, data structures, and running code are isolated from those of other processes.

Virtual memory does not require total isolation. Its core operation is mapping between virtual and physical addresses, with a separate OS-managed map, called a page table, for each process. The OS can put the same physical address in multiple processes’ page tables, creating shared memory.

Read-only shared memory is commonly used to allow many applications to use the same code for common library functions like malloc and printf without requiring that many copies of that code be present in memory. This works very well if all the processes desire the same version of that code, but if multiple versions exist then picking the right one for each copy becomes a challenge. Files containing code designed to be shared in this way are called shared object files (.so) in most operating systems except those created by Microsoft, where they are called dynamically-linked library files (.dll) instead. Getting the right versions of these shared pieces of code linked together often just works, but when it does not it can be quite challenging to resolve.

Read/write shared memory is sometimes also used to communicate between processes. Setting up such memory requires special system calls and some way of having the communicating processes communicate to the OS which other processes they want to share the memory with; multiple approaches to resolving these challenges exist and are all outside the scope of this class.

2 Permissions

In common operation, each process has its own virtual memory, but all processes share the same file system. A limited degree of isolation is supported by the file system itself using a facility known as permissions. Several variants of the permission system exists, but the Unix model has become the most widespread and is what this section discusses.

The operating system keeps track of a set of users and a set of groups. Each process is being run as and each file is owned by one user and one group. Each file also has a nine-bit permission, treated as nine separate Boolean flags. Three of these flags explain what the processes belonging to the user may do with the file; three what processed belonging to the group may do with the file; and three what other processes may do with the file. For each, the three flags are r, w, and x, with the following meanings

Flag	On a file	On a directory
`r`	read the contents of the file, viewing its bytes	read the contents of the directory, listing the files and directories it contains
`w`	write the contents of the file (i.e. change its bytes)	write the contents of the directory, adding or remove files or directories from it
`x`	execute the file, running it as a program	traverse the directory, gaining access to the files and directories it contains

There are some nuances about these permissions to note:

directory w enables modifying the contents of the directory, but the contents are only visible if the directory also has x permission; thus directory w is meaningless unless paired with x
file x enables running the file, but it can only be run if it can be read; thus file x is meaningless unless paired with r

Directory r vs x

If you have r permissions on directory f/ but not x permissions, ls f will lists the files and directories inside f, but cd f or ls f/a will fail with a permission denied error.

If you have x permissions on directory f/ but not r permissions, ls f will fail with a permission denied error, but cd f and ls f/a will succeed (assuming f/a exists).

Separating these two permissions seems strange and some non-Unix-like file systems merge the two into a single permission.

Creating vs modifying

Setting the contents of a file needs w permissions on that file.

Creating a file is adding something to the directory that contains it, and thus needs w permissions on that directory, not that file.

If I have r permissions on a file and w permissions on its directory, I can (1) create a new file, (2) copy the files contents into the new file, (3) remove the old file, and (4) rename the new file to have the same name as the old file, thus simulating w permission on the file. However, this is not quite the same as writing to the file because if some other process had the old file open when we did this 4-step process that process would not see any of this happen: removing a file does not invalidate open file handles against it and the operating system doesn’t actually reclaim the storage space used by the file until all such file handles are closed.

Traditionally, these permissions are ordered [user, group, other] with each being ordered [read, write, execute]. They are sometimes presented as a bitvector in octal (base-8) and sometimes as a letter for present permissions and a hyphen for missing permissions.

Both rwxr-xr-- and 0754 refer to the same permission set:

The user may read, write, and execute
The group may read and execute but not write
Others may only read, not write or execute

These file permissions, coupled with various permissions for changing a process’s user and group, create fairly course-grained but quite reliable isolation of parts of a file system. For example, the /usr directory and the directories and files inside it (including most installed programs and libraries) are typically owned by root with permissions rwxr-xr-x, meaning only root can change this part of the file system (i.e. install or uninstall programs) but everyone can list and run those programs.

The Super User

Most operating systems that handle users and permissions in any way also flag one or a few user accounts as super users. In Unix-derived OSes the only super-user account is called root. In some other OSes any user account can be a super-user account by being marked as an administrator account. A super-user account can ignore most or all permissions, doing things that other accounts cannot.

On the course VM, we give each of you a non-root account but also give you permissions to run a special program sudo, short for super-user do, which will launch processes as root instead of as you. This gives you near-total control over the system, letting you bypass most permissions, but also requires that you explicitly set out to do so by typing sudo in front of commands that you want to violate normal permissions, hopefully preventing you from accidentally doing something you’ll regret.

3 Chroot and friends

User accounts and file system permissions can be seen as a way to limit certain system calls, notably those handling files, to provide more isolation between applications than virtual memory along provides. This idea of adding more constraints to specific system calls to add more isolation between processes in how those calls are handled can be extended in various ways.

One very successful example of modifying just a few system calls to provide much more isolation is the chroot system call. The goal of this command is to let specific processes have much more limited file access than the usual permission system normally allows, limiting all of the processes activities to just a single directory and its subdirectories. It does this by changing what the root of the file system (i.e. directory /) is for that process.

After running chroot("/tmp/jail"), a process can only access files in the "/tmp/jail" directory tree. If it tries to fopen("/usr/bin/python3", "r") it will instead get what all other processes call "/tmp/jail/usr/bin/python3".

Most isolation techniques need some way to make exceptions; chroot can do this using hard links, single files that appear in multiple paths within the file system. Other approaches to sharing some parts of a file system within a chroot jail have also been added to more recent directory isolation tools

While chroot is one of the most popular of these techniques, it is not the only one. The ability to open sockets can be disabled or replaced by some non-socket stub; the amount of CPU time or memory that can be accessed can be limited; and so on. Adding a new isolation option to a system call requires changing the system call code in the operating system kernel, which is a nontrivial process, so the set of isolations is somewhat limited, but as needs for new isolation options are recognized operating systems tend to respond by adding new options.

4 Containers

The word container is used for many different kinds of isolation tools, but the most common meaning is for OS-level virtualization, which basically means a streamlined system for using a combination of all of the chroot-like isolated system call tools to create something that almost looks like an entirely new computer. The best-known container platform as of 2024 is Docker but there are many other container platforms with similar feature sets, each of which is tailored to appeal to a slightly different set of use cases.

Common to many of these container tools are the following:

Many containers can be opened on the same machine.
Most system calls are isolated per container, to the degree that most code can’t tell if it is running on a container or a physical machine.
The container tool supports sharing components between containers (but generally not between the host machine and the containers) such that if a dozen containers all have the same read-only file, fewer than a dozen copies need to reside on the machine.
Instead of a file system living on a disk, each container’s file system lives in a file or directory tree called an image.
Images contain an entry point, some program that is started when the image is run.
The container tool supports sharing images, uploading them too and downloading them from various websites.

Among many things that differ between container tools are

How isolated the containers are. Can they put pixels on a screen or only send text to a terminal? Can they request permission to access the host OS’s file system or is that entirely hidden from them? Can they open TCP ports directly, or only through some kind of port renaming manages by the container setup, or not at all? And so on.
What OSes the tools work on. Part of Docker’s success is its effort to support Linux, Windows, and MacOS. Part of what it gives up by doing that is access to advanced features that only some of those OSes support isolating.
How other processes can interact with running containers. Docker uses an on-machine client-server socket-based communication model. Podman is similar to Docker in many ways, but uses a Linux-specific superprocess/subprocess model instead. And so on.
What permissions are needed to create or run a container. Can regular users create images? Start containers? Connect to running containers and interact with them? If not, which users can and how are they limited?

As of 2024, I would characterize the container space as mature enough to use in production, but exploring and expanding with new options added frequently and not yet in the contracting and standardizing stage.

Kubernetes is a widely deployed tool for managing a large number of containers, possibly running on many different computers, and distributing work between them. Kubernetes also has other features and is out of scope for this class, but the name is often used in Container as a Service advertising and purchasing.

5 Virtual machines

Containers operate by isolating the behavior of specific system calls. Virtual machines operate by isolating every machine code instruction that would engage the operating system in the first place: system calls, failed virtual memory lookups, divide-by-zero exceptions, and so on. Normally, these events each cause a special function called a handler in the operating system to run. In virtualized mode, they are instead routed to a separate virtual OS’s handler, allowing a fully-fledged guest OS to be installed as a virtual machine running inside a host OS.

The host OS usually intervenes by pretending to be the various peripherals that a computer uses to connect with the world: screens and keyboards and mice and networks and disks and so on. The guest OS sees itself as if it were running on its own machine, but when it connects to anything outside the processor and memory, the host OS has the ability to see that connection and handle it however it wishes: forwarding it to actual hardware, translating it into some other kind of operation, ignoring it entirely, etc.

A number of virtual machines can share the same hardware, allowing fuller utilization of hardware resources. A single virtual machine can be transferred from one piece of hardware to another, allowing easier replacement of failing hardware and upgrades to new hardware. For these reasons, if you rent a computer or server that is not physically located in your building it is likely that what you are actually getting is a virtual machine.

6 Emulators

The most complete way to isolate a process from others is to not actually run it at all, instead running a process that pretends to be a computer, parsing the machine code and updating process state to emulate computer state. Emulators let me run code compiled for x86-64 on an Arm chip or vice versa, or even run code compiled for x86-64 on an x86-64 chip without actually running it, instead running different instructions with side effects, behaviors, and limitations selected by the emulator designer.

Emulators can be incorporated into a full virtual machine system as part of an isolation system, but they can also be designed to translate one program into the most similar program for another ISA and run it directly, providing no isolation at all. I include them on this page not because they are always, or even often, used for strong isolation but because some other isolation tools, notably virtual machines, are often configured to use emulation if the ISA of the virtual machine and the ISA of the host machine are not the same.