In more than 70 years of software design, we have only found four useful ways to story data:
You learned about reference storage (via pointers or linked data structures) in CS 225 and earlier courses. Let’s focus on the other types.
Most digital computers use something that has two distinguishable states. That might be positive vs negative magnetic polarity, or high vs low voltage, or smooth vs rough optical surface, or bright vs dark light, or any number of other distinguishable pairs.
These two values can be compared via enumeration to any 2-value set, most commonly either {False, True} or {0, 1}.
A sequence of values from a -value set has possible values. A convenient way to structure those is using place-value numbers. The numeric value of an element of the sequence is . You’ve likely been using this model with the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} (which has ) since you were a child: it’s currently the dominant number system in almost every nation worldwide.
Computers have ready access to a set of bits, which can be treated as meaning {0, 1}, and thus place-value numbers with are dominant in computers. Everything you know about base-ten decimal
numbers also applies to base-two binary
numbers provided you replace any part of decimal algorithm that references ten with a reference to two instead.
You’ve likely learned to handle negative numbers using an explicit sign, with computation algorithms that have different cases for different signs. But that’s not necessary if we have a fixed number of digits.
The decimal number is defined to be the number which, when added to , gives . If we only have six digits, that number is : no negative sign needed. The 7-digit sum which, when truncated to 6 digits, is .
The eight-digit binary number ’s 8-bit negation is . Finding this is as simple as flipping all bits, then adding 1; or equivalently leaving all low-order 0s and the lowest-order 1 unchanged and flipping everything else. This approach to storing negative binary integers is called two’s complement.
Left out of this definition is how we tell if a given binary number like 1101 is supposed to be a positive number (13) or a negative number (−3). The tradition is to base this on the highest-order bit in the number representation (1 for negative, 0 for non-negative); if 1101 is stored in a 4-bit slot it’s negative, but if 00001101 is stored in an 8-bit slot it’s positive.
It is unusual to call a binary digit a digit; usually it is called a bit
instead.
Many programming languages allow binary numbers to be written as literals by preceding them with 0b
, as 0b101010100
. This notation was mostly added to languages in the 2010s and is uncommon in older code.
Humans have difficulty making sense of long sequences of digits, so we tend to group them. For example, many decimal numbers are written with groups of 3 digits and some kind of separator between groups. These groups have possible values, making the separated number a type of base-thousand place-value number with digits 000 through 999.
For binary, groups of 3 digits form a base-8 octal
number, and instead of writing those digits as 000 through 1111 we write them 0 through 7 instead. Groups of 4 digits are more common, forming base-16 hexadecimal
numbers, where instead of writing 0000 through 1111 we use the digits {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F}.1 Both upper- and lower-case hex digits are common; 1A = 1a
Many programming languages allow octal numbers to be written as literals by preceding them with 0
, as 0524
; this notation is confusing and 0o
is becoming more common in recent languages, as 0o524
. Every programming language I’ve used allows hexadecimal numbers to be written as literals by preceding them with 0x
, as 0x154
;
Groups of 3 bits are often called octal digits
. Groups of 4 bits are called either hex digits
or nibbles
2 no standard spelling; nybble, nyble, and nybl are all also used., with hex digit
being more common when writing it down and nibble
being more common when referring to it as part of data.
Computers find that it’s uncommon to use values as small as a bit or even a nibble, and usually handle groups of 8 bits, called bytes
or octets
3 octet
is more precise, but byte
is far more common.. The resulting base-256 number system has no standard name and no common programming language literal syntax.
Power-of-two numbers show up in many places in hardware and hardware-interfacing software. It is worth learning the vocabulary used to express them.
Value | base-10 | Suffix | Pronounced |
---|---|---|---|
210 | 1024 | Ki | Kilo |
220 | 1,048,576 | Mi | Mega |
230 | 1,073,741,824 | Gi | Giga |
240 | 1,099,511,627,776 | Ti | Tera |
250 | 1,125,899,906,842,624 | Pi | Peta |
260 | 1,152,921,504,606,846,976 | Ei | Exa |
In all cases above, the i is sometimes dropped to just say (e.g.) G instead of Gi. The i clarifies that we mean powers of 2 like 1024, not powers of 10 like 1000. G could mean either 230 or 109, numbers the differ by about 7%; which one is meant can only be guessed from context unless the clarifying i is present.
Most software developers I know have memorized the powers of two between 20 and 210. This allows them to efficiently recognize and work with all powers of two using a trick to split larger powers into a suffix and a smaller power. For example, 227 = 27 220 = 128M. This pattern works for any power of 2: the 1’s digit of the exponent becomes a number, the 10’s digit becomes a suffix. Thus
Value | Split | Written |
---|---|---|
227 | 27 220 | 128Mi |
23 | 23 20 | 8 |
239 | 29 230 | 512Gi |
Logarithms with base 2 (written log2(n) or lg(n) or sometimes just log(n)) do the inverse of this: lg(64G) = lg(26 230) = lg(236) = 36.
Fill in the rest of the following table.
Exponent | Written As |
---|---|
17 | 128Ki |
3 | |
38 | |
11 | |
256Mi | |
16Gi | |
32 |
Structures store several distinct values sequentially.
Most structures are understood statically, meaning that the type of value tells us everything we need to know about what values are stored in what order. For example,
struct {
int id;
float income;
char *name;
}
tells us there’s 4 bytes of id
, then 4 bytes of income
, then 8 bytes of pointer to name
.
Some structures are actually discriminant unions: depending on what value you find first you might expect different values after it. For example,
struct {
int kind;
union {
double number;
char *string;
}
}
always has a 4-byte kind
, but after that might be either an 8-byte number
or an 8-byte pointer to a string
; we’d need to look at kind
and compare it to some design to know which second value the structure has.
Arrays are lists stored sequentially. There are three common ways of indicating how many things are in an array.
Some arrays have a static length, meaning you know how many elements are in it based on some external knowledge, not based on the stored data itself. For example, the int
datatype tells the compiler to look for lists of 32 bits (i.e. 4 bytes).
Some arrays have a stored length, meaning they are actually made of two parts: one part is a number indicating the length of the other part, which is the actual array.
Some arrays have a sentinel value or values: something in the array which indicates either you’ve passed the end
or this is the end
.
char *
In C, strings are stores as char *
; that is, a pointer to an array of bytes.
The array’s length is indicated with a sentinel value at the end of the array; in particular, the byte 0.
The bytes in the array are themselves two-part structures with each structure being one byte in size. The two parts split at the highest-order 0 bit in the byte. The number of bytes in the array is indicated both with a length and a sentinel. The length is the number of bits of the first part of the struct in the first byte, or 1 if it has no bits (the first struct will never have a 1-bit first part). The sentinel is every struct except the first has a 1-bit first part so if a non-1-bit first part is encountered that means the array has ended.
The bits in the second parts encode a larger number; concatenate them to find that number, with the most-significant bits coming from the first struct.
The number is a code point in the very large enumerated value set called Unicode.
Each code point is part of an array of code points that collectively define one grapheme, meaning a visual component of typewritten language4 This isn’t quite true: some code points describe other things like auditory bells or character deletions, but they are rarely used.. These arrays are handled by sentinel values, where most code points mark the end of a list of code points contributing to a grapheme but a few don’t.
Most numbers we use in programming, including both signed
and unsigned
short
, int
, long
, and long long
, as well as float
and double
, have more bits than can fit in a single byte. We break those bits into bytes and put them in adjacent spots in memory, but we have two orders we could do that in.
Big-endian numbers put the big end of the number (the most-significant place value) first, meaning at the smallest address. This is similar to how decimal numbers are written in left-to-right languages like English and is the order used in most network standards and protocols.
Little-endian numbers put the little end of the number (the least-significant place value) first, meaning at the smallest address. This is similar to how decimal numbers are written in right-to-left languages like Arabic and is the order used in most desktop and notebook computer hardware.
Suppose I store int date = 20240118
in memory. I’m running a little-endian computer so it will be stored in little-endian byte order.
An int
needs 4 bytes, so let’s store it in addresses 0xffff80
through 0xffff83
. First we convert to hexadecimal (20240118 == 0x134d6f6
), then break it into bytes 01
34
d6
f6
, and then put those in memory little-end first
Address | Value | Note |
---|---|---|
ffff80 | 0xf6 | lowest-order byte |
ffff81 | 0xd6 | |
ffff82 | 0x34 | |
ffff83 | 0x01 | highest-order byte |
It’s common to display memory either with small addresses at the bottom or at the top; feel free to without changing the table’s meaning.
Suppose I store short[3] date = {0x2024, 0x1, 0x18};
in memory. I’m running a little-endian computer so it will be stored in little-endian byte order, but the order of elements in an array is always smallest index at smallest address.
An short
needs 2 bytes an there are three shorts
in the array, so let’s store it in addresses 0xffff80
through 0xffff85
.
Address | Value | Notes |
---|---|---|
ffff80 | 0x24 | Low-order byte of first value |
ffff81 | 0x20 | High-order byte of first value |
ffff82 | 0x01 | Low-order byte of second value |
ffff83 | 0x00 | High-order byte of second value |
ffff84 | 0x18 | Low-order byte of third value |
ffff85 | 0x00 | High-order byte of third value |
It’s common to display memory either with small addresses at the bottom or at the top; feel free to without changing the table’s meaning.
Suppose I send short[3] date = {0x2024, 0x1, 0x18};
over a network. Networks assume big-endian byte order, but the order of elements in an array is always smallest index at smallest address.
I would send the bytes in the following order:
Suppose each of the following x
s is stored in memory starting at address 0x10000
. What is the fifth byte of each, that is the byte is stored at address 0x10004
?
Thing to store | Fifth byte |
---|---|
uint64_t x = 0x0102030405060708uL |
0x |
int x[3] = {1,2,3} |
0x |
short x[5] = {1,2,3,4,5} |
0x |
short x[5] = {0x110,0x220,0x330,0x440,0x550} |
0x |
char x[8] = {1,2,3,4,5,6,7,8} |
0x |
struct { char a; char b; short c; int e; } x; x = {1, 2, 0x304, 0x5060708} |
0x |
Both memory (RAM) and files (disk) present themselves to us as an array of bytes. In memory, we call these indices addresses
or pointers
, as follows:
The address of a byte in memory is the index of that byte in the array of bytes that is memory.5 We’ll see later in the course that there are two different addresses, physical and virtual, but both fit this same definition.
A pointer is a value we intend to use as the address of something.
The address of a multi-byte value comprised of multiple adjacent bytes is the address of its first byte, meaning the byte with the numerically-smallest address.
Pointers are themselves multi-byte numbers, and hence stored with endianness. The number of bytes per pointer is a fundamental design decision when making a new computer, and is sometimes references when describing the processor itself; for example, a 64-bit processor is one that stores 64-bit (8-byte) pointers6 Nuance: it is pointers, not addresses, that these numbers describe. I am writing this on a 64-bit computer, meaning it has 64-bit pointers. However, it has 48-bit addresses: the remaining 16 bits of the pointers are ignored by the processor..
You’ve programmed with pointers extensively in previous classes, so we won’t add much more about them here.
Sometimes we store data that has to pass through some kind of involved function to retrieve the data we actually want. Three types of functions are particularly common: serialization, compression, and encryption.
It is common to store data in computer memory in a somewhat scattered way, with gaps where nothing is stored and cached values that can speed up future work mixed in with the core information we are storing. When sharing the data over a network or storing it in a file it is common to extract just the essential information and bundle it up into a compact sequence of bytes. There are multiple names for such a sequence of bytes, but serialization of the data
is a relatively common one.
Functions that serialize data have many forms. Common objectives in their design include
These objectives are generally in conflict with one another, meaning many functions are in use each with its own balance of these objectives.
Suppose my code contains a doubly-linked list of four 16-bit integers with pointers to both the head and tail node of the list. It is likely that the nodes are scattered around memory somewhat based on the order in which I allocated them and other data structures.
A human-readable serialization of this list might be the 20-byte string
, or 5b 31 32 34 2c 20 31 32 38 2c 20 32 32 35 2c 20 33 34 30 5d.[124, 128, 225, 340]
A compact serialization of this list might be the 8 bytes of the values in a little-endian array, or 7c 00 80 00 e1 00 54 01.
A fast-to-run serialization of this list might move the nodes together but keep all ten 32-bit pointers (as well as the four 16-bit values) in the serialization, using at least 48 bytes.
Lossless compression functions apply an invertible function to data when storing it and the function’s inverse when reading the stored data. They pick the function with the hope that common values will result in smaller results than uncommon values will. The quality of lossless compression depends on how accurately it estimates what values will be common; values it expects will be uncommon tend to get bigger, not smaller, when compressed.
Suppose I expect most bytes to be small numbers like 0, 1, and 2 with only rare larger values. I might then compress a string of bytes into a string of bits as follows:
1
bits, then a 0
bit, than the bits from the byte less significant than the most-significant 1 bit.byte | bits |
---|---|
00000000 | 0 |
00000001 | 10 |
00000010 | 1100 |
00000011 | 1101 |
00000100 | 111000 |
00000101 | 111001 |
… | … |
11111101 | 1111111101111101 |
11111110 | 1111111101111110 |
11111111 | 1111111101111111 |
If we’re correct and most bytes are 0 or 1, this resulting bit string could be less than a quarter as long as the uncompressed bytes. If we’re wrong and most bytes are larger numbers, this compression
could double the length of the value.
There are also functions called lossy
compression. Conceptually these are the composition of two functions: one changes the input to a different by similar value that will take less space to store, and the other is a lossless compression function. Similarity is tricky to define and most successful lossy compression functions are each limited to a single domain (such as audio or video) where research has established a good model of similarity.
Encryption functions are invertible functions that take an input byte sequence and produce an output byte sequence that generally has the same or similar size but that looks like a nonsensical string of random bits. Only people who know the secret of how that particular encryption function worked can invert the function and get back meaningful data.
The most common encryption functions are not themselves secret; instead, they have a special parameter (called a key
) that is secret and cannot be readily guessed: that is, it is long enough that brute-force trying all combinations takes too long to be practical and the function uses it in a way that makes deriving the key from the function and the encrypted output infeasible.