In more than 70 years of software design, we have only found four useful ways to story data:

You learned about reference storage (via pointers or linked data structures) in CS 225 and earlier courses. Let’s focus on the other types.

1 Bits

Most digital computers use something that has two distinguishable states. That might be positive vs negative magnetic polarity, or high vs low voltage, or smooth vs rough optical surface, or bright vs dark light, or any number of other distinguishable pairs.

These two values can be compared via enumeration to any 2-value set, most commonly either {False, True} or {0, 1}.

2 Binary integers

A sequence of $n$ values from a $b$ -value set has $b^n$ possible values. A convenient way to structure those is using place-value numbers. The numeric value of an element of the sequence $(s_0, s_1, s_2, ...)$ is $s_0b^0 + s_1 b^1 + s_2 b^2 + ...$ . You’ve likely been using this model with the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} (which has $b=10$ ) since you were a child: it’s currently the dominant number system in almost every nation worldwide.

Computers have ready access to a set of bits, which can be treated as meaning {0, 1}, and thus place-value numbers with $b=2$ are dominant in computers. Everything you know about base-ten decimal numbers also applies to base-two binary numbers provided you replace any part of decimal algorithm that references ten with a reference to two instead.

Negative numbers in a computer

You’ve likely learned to handle negative numbers using an explicit sign, with computation algorithms that have different cases for different signs. But that’s not necessary if we have a fixed number of digits.

The decimal number $-340$ is defined to be the number which, when added to $340$ , gives $0$ . If we only have six digits, that number is $999660$ : no negative sign needed. The 7-digit sum $340 + 999660 = 1000000$ which, when truncated to 6 digits, is $000000 = 0$ .

The eight-digit binary number $0010110$ ’s 8-bit negation is $1101010$ . Finding this is as simple as flipping all bits, then adding 1; or equivalently leaving all low-order 0s and the lowest-order 1 unchanged and flipping everything else. This approach to storing negative binary integers is called two’s complement.

Left out of this definition is how we tell if a given binary number like 1101 is supposed to be a positive number (13) or a negative number (−3). The tradition is to base this on the highest-order bit in the number representation (1 for negative, 0 for non-negative); if 1101 is stored in a 4-bit slot it’s negative, but if 00001101 is stored in an 8-bit slot it’s positive.

It is unusual to call a binary digit a digit; usually it is called a bit instead.

Many programming languages allow binary numbers to be written as literals by preceding them with 0b, as 0b101010100. This notation was mostly added to languages in the 2010s and is uncommon in older code.

3 Base 16 and Base 256

Humans have difficulty making sense of long sequences of digits, so we tend to group them. For example, many decimal numbers are written with groups of 3 digits and some kind of separator between groups. These groups have $10^3$ possible values, making the separated number a type of base-thousand place-value number with digits 000 through 999.

For binary, groups of 3 digits form a base-8 octal number, and instead of writing those digits as 000 through 1111 we write them 0 through 7 instead. Groups of 4 digits are more common, forming base-16 hexadecimal numbers, where instead of writing 0000 through 1111 we use the digits {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F}.¹¹ Both upper- and lower-case hex digits are common; 1A = 1a

Many programming languages allow octal numbers to be written as literals by preceding them with 0, as 0524; this notation is confusing and 0o is becoming more common in recent languages, as 0o524. Every programming language I’ve used allows hexadecimal numbers to be written as literals by preceding them with 0x, as 0x154;

Groups of 3 bits are often called octal digits. Groups of 4 bits are called either hex digits or nibbles²² no standard spelling; nybble, nyble, and nybl are all also used., with hex digit being more common when writing it down and nibble being more common when referring to it as part of data.

Computers find that it’s uncommon to use values as small as a bit or even a nibble, and usually handle groups of 8 bits, called bytes or octets³³ octet is more precise, but byte is far more common.. The resulting base-256 number system has no standard name and no common programming language literal syntax.

4 Power-of-1024 suffixes

Power-of-two numbers show up in many places in hardware and hardware-interfacing software. It is worth learning the vocabulary used to express them.

Value	base-10	Suffix	Pronounced
2¹⁰	1024	Ki	Kilo
2²⁰	1,048,576	Mi	Mega
2³⁰	1,073,741,824	Gi	Giga
2⁴⁰	1,099,511,627,776	Ti	Tera
2⁵⁰	1,125,899,906,842,624	Pi	Peta
2⁶⁰	1,152,921,504,606,846,976	Ei	Exa

In all cases above, the i is sometimes dropped to just say (e.g.) G instead of Gi. The i clarifies that we mean powers of 2 like 1024, not powers of 10 like 1000. G could mean either 2³⁰ or 10⁹, numbers the differ by about 7%; which one is meant can only be guessed from context unless the clarifying i is present.

Most software developers I know have memorized the powers of two between 2⁰ and 2¹⁰. This allows them to efficiently recognize and work with all powers of two using a trick to split larger powers into a suffix and a smaller power. For example, 2²⁷ = 2⁷ 2²⁰ = 128M. This pattern works for any power of 2: the 1’s digit of the exponent becomes a number, the 10’s digit becomes a suffix. Thus

Value	Split	Written
2²⁷	2⁷ 2²⁰	128Mi
2³	2³ 2⁰	8
2³⁹	2⁹ 2³⁰	512Gi

Logarithms with base 2 (written log₂(n) or lg(n) or sometimes just log(n)) do the inverse of this: lg(64G) = lg(2⁶ 2³⁰) = lg(2³⁶) = 36.

Fill in the rest of the following table.

Exponent	Written As
17	128Ki
3
38
11
	256Mi
	16Gi
	32

Answers:

8, 256Gi, 2Ki, 28, 34, 5

5 Composition through adjacency

5.1 Structures

Structures store several distinct values sequentially.

Most structures are understood statically, meaning that the type of value tells us everything we need to know about what values are stored in what order. For example,

struct {
  int id;
  float income;
  char *name;
}

tells us there’s 4 bytes of id, then 4 bytes of income, then 8 bytes of pointer to name.

Some structures are actually discriminant unions: depending on what value you find first you might expect different values after it. For example,

struct {
  int kind;
  union {
    double number;
    char *string;
  }
}

always has a 4-byte kind, but after that might be either an 8-byte number or an 8-byte pointer to a string; we’d need to look at kind and compare it to some design to know which second value the structure has.

5.2 Arrays

Arrays are lists stored sequentially. There are three common ways of indicating how many things are in an array.

Some arrays have a static length, meaning you know how many elements are in it based on some external knowledge, not based on the stored data itself. For example, the int datatype tells the compiler to look for lists of 32 bits (i.e. 4 bytes).

Some arrays have a stored length, meaning they are actually made of two parts: one part is a number indicating the length of the other part, which is the actual array.

Some arrays have a sentinel value or values: something in the array which indicates either you’ve passed the end or this is the end.

5.3 Case study: UTF-8 `char *`

In C, strings are stores as char *; that is, a pointer to an array of bytes.

The array’s length is indicated with a sentinel value at the end of the array; in particular, the byte 0.

The bytes in the array are themselves two-part structures with each structure being one byte in size. The two parts split at the highest-order 0 bit in the byte. The number of bytes in the array is indicated both with a length and a sentinel. The length is the number of bits of the first part of the struct in the first byte, or 1 if it has no bits (the first struct will never have a 1-bit first part). The sentinel is every struct except the first has a 1-bit first part so if a non-1-bit first part is encountered that means the array has ended.

The bits in the second parts encode a larger number; concatenate them to find that number, with the most-significant bits coming from the first struct.

The number is a code point in the very large enumerated value set called Unicode.

Each code point is part of an array of code points that collectively define one grapheme, meaning a visual component of typewritten language⁴⁴ This isn’t quite true: some code points describe other things like auditory bells or character deletions, but they are rarely used.. These arrays are handled by sentinel values, where most code points mark the end of a list of code points contributing to a grapheme but a few don’t.

5.4 Multi-byte numbers and Endianness

Most numbers we use in programming, including both signed and unsigned short, int, long, and long long, as well as float and double, have more bits than can fit in a single byte. We break those bits into bytes and put them in adjacent spots in memory, but we have two orders we could do that in.

Big-endian numbers put the big end of the number (the most-significant place value) first, meaning at the smallest address. This is similar to how decimal numbers are written in left-to-right languages like English and is the order used in most network standards and protocols.

Little-endian numbers put the little end of the number (the least-significant place value) first, meaning at the smallest address. This is similar to how decimal numbers are written in right-to-left languages like Arabic and is the order used in most desktop and notebook computer hardware.

Suppose I store int date = 20240118 in memory. I’m running a little-endian computer so it will be stored in little-endian byte order.

An int needs 4 bytes, so let’s store it in addresses 0xffff80 through 0xffff83. First we convert to hexadecimal (20240118 == 0x134d6f6), then break it into bytes 01 34 d6 f6, and then put those in memory little-end first

Address	Value	Note
ffff80	0xf6	lowest-order byte
ffff81	0xd6
ffff82	0x34
ffff83	0x01	highest-order byte

It’s common to display memory either with small addresses at the bottom or at the top; feel free to without changing the table’s meaning.

Suppose I store short[3] date = {0x2024, 0x1, 0x18}; in memory. I’m running a little-endian computer so it will be stored in little-endian byte order, but the order of elements in an array is always smallest index at smallest address.

An short needs 2 bytes an there are three shorts in the array, so let’s store it in addresses 0xffff80 through 0xffff85.

Address	Value	Notes
ffff80	0x24	Low-order byte of first value
ffff81	0x20	High-order byte of first value
ffff82	0x01	Low-order byte of second value
ffff83	0x00	High-order byte of second value
ffff84	0x18	Low-order byte of third value
ffff85	0x00	High-order byte of third value

It’s common to display memory either with small addresses at the bottom or at the top; feel free to without changing the table’s meaning.

Suppose I send short[3] date = {0x2024, 0x1, 0x18}; over a network. Networks assume big-endian byte order, but the order of elements in an array is always smallest index at smallest address.

I would send the bytes in the following order:

0x20
0x24
0x00
0x01
0x00
0x18

Suppose each of the following xs is stored in memory starting at address 0x10000. What is the fifth byte of each, that is the byte is stored at address 0x10004?

Thing to store	Fifth byte
`uint64_t x = 0x0102030405060708uL`	0x
`int x[3] = {1,2,3}`	0x
`short x[5] = {1,2,3,4,5}`	0x
`short x[5] = {0x110,0x220,0x330,0x440,0x550}`	0x
`char x[8] = {1,2,3,4,5,6,7,8}`	0x
`struct { char a; char b; short c; int e; } x;` `x = {1, 2, 0x304, 0x5060708}`	0x

Little-endian answers:

0x04, 0x02, 0x03, 0x30, 0x05, 0x08

Big-endian answers:

0x05, 0x00, 0x00, 0x03, 0x05, 0x05

6 Composition with Pointers

Both memory (RAM) and files (disk) present themselves to us as an array of bytes. In memory, we call these indices addresses or pointers, as follows:

The address of a byte in memory is the index of that byte in the array of bytes that is memory.⁵⁵ We’ll see later in the course that there are two different addresses, physical and virtual, but both fit this same definition.
A pointer is a value we intend to use as the address of something.
The address of a multi-byte value comprised of multiple adjacent bytes is the address of its first byte, meaning the byte with the numerically-smallest address.

Pointers are themselves multi-byte numbers, and hence stored with endianness. The number of bytes per pointer is a fundamental design decision when making a new computer, and is sometimes references when describing the processor itself; for example, a 64-bit processor is one that stores 64-bit (8-byte) pointers⁶⁶ Nuance: it is pointers, not addresses, that these numbers describe. I am writing this on a 64-bit computer, meaning it has 64-bit pointers. However, it has 48-bit addresses: the remaining 16 bits of the pointers are ignored by the processor..

You’ve programmed with pointers extensively in previous classes, so we won’t add much more about them here.

7 Functions

Sometimes we store data that has to pass through some kind of involved function to retrieve the data we actually want. Three types of functions are particularly common: serialization, compression, and encryption.

7.1 Serialization

It is common to store data in computer memory in a somewhat scattered way, with gaps where nothing is stored and cached values that can speed up future work mixed in with the core information we are storing. When sharing the data over a network or storing it in a file it is common to extract just the essential information and bundle it up into a compact sequence of bytes. There are multiple names for such a sequence of bytes, but serialization of the data is a relatively common one.

Functions that serialize data have many forms. Common objectives in their design include

Easy to implement.
Fast to run.
Compact serializations with no wasted bytes.
Human-readable, with bytes that represent characters that in turn represent the data.
General, usable with many different structures of data.

These objectives are generally in conflict with one another, meaning many functions are in use each with its own balance of these objectives.

Suppose my code contains a doubly-linked list of four 16-bit integers with pointers to both the head and tail node of the list. It is likely that the nodes are scattered around memory somewhat based on the order in which I allocated them and other data structures.

A human-readable serialization of this list might be the 20-byte string [124, 128, 225, 340], or 5b 31 32 34 2c 20 31 32 38 2c 20 32 32 35 2c 20 33 34 30 5d.

A compact serialization of this list might be the 8 bytes of the values in a little-endian array, or 7c 00 80 00 e1 00 54 01.

A fast-to-run serialization of this list might move the nodes together but keep all ten 32-bit pointers (as well as the four 16-bit values) in the serialization, using at least 48 bytes.

7.2 Compression

Lossless compression functions apply an invertible function to data when storing it and the function’s inverse when reading the stored data. They pick the function with the hope that common values will result in smaller results than uncommon values will. The quality of lossless compression depends on how accurately it estimates what values will be common; values it expects will be uncommon tend to get bigger, not smaller, when compressed.

Suppose I expect most bytes to be small numbers like 0, 1, and 2 with only rare larger values. I might then compress a string of bytes into a string of bits as follows:

Find the place of the most-significant 1 bit in the byte, call that place n
Output n 1 bits, then a 0 bit, than the bits from the byte less significant than the most-significant 1 bit.

byte	bits
00000000	0
00000001	10
00000010	1100
00000011	1101
00000100	111000
00000101	111001
…	…
11111101	1111111101111101
11111110	1111111101111110
11111111	1111111101111111

If we’re correct and most bytes are 0 or 1, this resulting bit string could be less than a quarter as long as the uncompressed bytes. If we’re wrong and most bytes are larger numbers, this compression could double the length of the value.

There are also functions called lossy compression. Conceptually these are the composition of two functions: one changes the input to a different by similar value that will take less space to store, and the other is a lossless compression function. Similarity is tricky to define and most successful lossy compression functions are each limited to a single domain (such as audio or video) where research has established a good model of similarity.

7.3 Encryption

Encryption functions are invertible functions that take an input byte sequence and produce an output byte sequence that generally has the same or similar size but that looks like a nonsensical string of random bits. Only people who know the secret of how that particular encryption function worked can invert the function and get back meaningful data.

The most common encryption functions are not themselves secret; instead, they have a special parameter (called a key) that is secret and cannot be readily guessed: that is, it is long enough that brute-force trying all combinations takes too long to be practical and the function uses it in a way that makes deriving the key from the function and the encrypted output infeasible.