Bits, Bytes, and Composite Values

In more than 70 years of software design, we have only found four useful ways to story data:

You learned about reference storage (via pointers or linked data structures) in CS 225 and earlier courses. Let’s focus on the other types.

1 Bits

Most digital computers use something that has two distinguishable states. That might be positive vs negative magnetic polarity, or high vs low voltage, or smooth vs rough optical surface, or bright vs dark light, or any number of other distinguishable pairs.

These two values can be compared via enumeration to any 2-value set, most commonly either {False, True} or {0, 1}.

2 Binary integers

A sequence of nn values from a bb-value set has bnb^n possible values. A convenient way to structure those is using place-value numbers. The numeric value of an element of the sequence (s0,s1,s2,...)(s_0, s_1, s_2, ...) is s0b0+s1b1+s2b2+...s_0b^0 + s_1 b^1 + s_2 b^2 + .... You’ve likely been using this model with the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} (which has b=10b=10) since you were a child: it’s currently the dominant number system in almost every nation worldwide.

Computers have ready access to a set of bits, which can be treated as meaning {0, 1}, and thus place-value numbers with b=2b=2 are dominant in computers. Everything you know about base-ten decimal numbers also applies to base-two binary numbers provided you replace any part of decimal algorithm that references ten with a reference to two instead.

Negative numbers in a computer

You’ve likely learned to handle negative numbers using an explicit sign, with computation algorithms that have different cases for different signs. But that’s not necessary if we have a fixed number of digits.

The decimal number 340-340 is defined to be the number which, when added to 340340, gives 00. If we only have six digits, that number is 999660999660: no negative sign needed. The 7-digit sum 340+999660=1000000340 + 999660 = 1000000 which, when truncated to 6 digits, is 000000=0000000 = 0.

The eight-digit binary number 00101100010110’s 8-bit negation is 11010101101010. Finding this is as simple as flipping all bits, then adding 1; or equivalently leaving all low-order 0s and the lowest-order 1 unchanged and flipping everything else. This approach to storing negative binary integers is called two’s complement.

Left out of this definition is how we tell if a given binary number like 1101 is supposed to be a positive number (13) or a negative number (−3). The tradition is to base this on the highest-order bit in the number representation (1 for negative, 0 for non-negative); if 1101 is stored in a 4-bit slot it’s negative, but if 00001101 is stored in an 8-bit slot it’s positive.

It is unusual to call a binary digit a digit; usually it is called a bit instead.

Many programming languages allow binary numbers to be written as literals by preceding them with 0b, as 0b101010100. This notation was mostly added to languages in the 2010s and is uncommon in older code.

3 Base 16 and Base 256

Humans have difficulty making sense of long sequences of digits, so we tend to group them. For example, many decimal numbers are written with groups of 3 digits and some kind of separator between groups. These groups have 10310^3 possible values, making the separated number a type of base-thousand place-value number with digits 000 through 999.

For binary, groups of 3 digits form a base-8 octal number, and instead of writing those digits as 000 through 1111 we write them 0 through 7 instead. Groups of 4 digits are more common, forming base-16 hexadecimal numbers, where instead of writing 0000 through 1111 we use the digits {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F}.1 Both upper- and lower-case hex digits are common; 1A = 1a

Many programming languages allow octal numbers to be written as literals by preceding them with 0, as 0524; this notation is confusing and 0o is becoming more common in recent languages, as 0o524. Every programming language I’ve used allows hexadecimal numbers to be written as literals by preceding them with 0x, as 0x154;

Groups of 3 bits are often called octal digits. Groups of 4 bits are called either hex digits or nibbles2 no standard spelling; nybble, nyble, and nybl are all also used., with hex digit being more common when writing it down and nibble being more common when referring to it as part of data.

Computers find that it’s uncommon to use values as small as a bit or even a nibble, and usually handle groups of 8 bits, called bytes or octets3 octet is more precise, but byte is far more common.. The resulting base-256 number system has no standard name and no common programming language literal syntax.

4 Power-of-1024 suffixes

Power-of-two numbers show up in many places in hardware and hardware-interfacing software. It is worth learning the vocabulary used to express them.

Value base-10 Suffix Pronounced
210 1024 Ki Kilo
220 1,048,576 Mi Mega
230 1,073,741,824 Gi Giga
240 1,099,511,627,776 Ti Tera
250 1,125,899,906,842,624 Pi Peta
260 1,152,921,504,606,846,976 Ei Exa

In all cases above, the i is sometimes dropped to just say (e.g.) G instead of Gi. The i clarifies that we mean powers of 2 like 1024, not powers of 10 like 1000. G could mean either 230 or 109, numbers the differ by about 7%; which one is meant can only be guessed from context unless the clarifying i is present.

Most software developers I know have memorized the powers of two between 20 and 210. This allows them to efficiently recognize and work with all powers of two using a trick to split larger powers into a suffix and a smaller power. For example, 227 = 27 220 = 128M. This pattern works for any power of 2: the 1’s digit of the exponent becomes a number, the 10’s digit becomes a suffix. Thus

Value Split Written
227 27 220 128Mi
23 23 20 8
239 29 230 512Gi

Logarithms with base 2 (written log2(n) or lg(n) or sometimes just log(n)) do the inverse of this: lg(64G) = lg(26 230) = lg(236) = 36.

Fill in the rest of the following table.

Exponent Written As
17 128Ki
3
38
11
256Mi
16Gi
32
Answers: 8, 256Gi, 2Ki, 28, 34, 5

5 Composition through adjacency

5.1 Structures

Structures store several distinct values sequentially.

Most structures are understood statically, meaning that the type of value tells us everything we need to know about what values are stored in what order. For example,

struct {
  int id;
  float income;
  char *name;
}

tells us there’s 4 bytes of id, then 4 bytes of income, then 8 bytes of pointer to name.

Some structures are actually discriminant unions: depending on what value you find first you might expect different values after it. For example,

struct {
  int kind;
  union {
    double number;
    char *string;
  }
}

always has a 4-byte kind, but after that might be either an 8-byte number or an 8-byte pointer to a string; we’d need to look at kind and compare it to some design to know which second value the structure has.

5.2 Arrays

Arrays are lists stored sequentially. There are three common ways of indicating how many things are in an array.

Some arrays have a static length, meaning you know how many elements are in it based on some external knowledge, not based on the stored data itself. For example, the int datatype tells the compiler to look for lists of 32 bits (i.e. 4 bytes).

Some arrays have a stored length, meaning they are actually made of two parts: one part is a number indicating the length of the other part, which is the actual array.

Some arrays have a sentinel value or values: something in the array which indicates either you’ve passed the end or this is the end.

5.3 Case study: UTF-8 char *

In C, strings are stores as char *; that is, a pointer to an array of bytes.

The array’s length is indicated with a sentinel value at the end of the array; in particular, the byte 0.

The bytes in the array are themselves two-part structures with each structure being one byte in size. The two parts split at the highest-order 0 bit in the byte. The number of bytes in the array is indicated both with a length and a sentinel. The length is the number of bits of the first part of the struct in the first byte, or 1 if it has no bits (the first struct will never have a 1-bit first part). The sentinel is every struct except the first has a 1-bit first part so if a non-1-bit first part is encountered that means the array has ended.

The bits in the second parts encode a larger number; concatenate them to find that number, with the most-significant bits coming from the first struct.

The number is a code point in the very large enumerated value set called Unicode.

Each code point is part of an array of code points that collectively define one grapheme, meaning a visual component of typewritten language4 This isn’t quite true: some code points describe other things like auditory bells or character deletions, but they are rarely used.. These arrays are handled by sentinel values, where most code points mark the end of a list of code points contributing to a grapheme but a few don’t.

5.4 Multi-byte numbers and Endianness

Most numbers we use in programming, including both signed and unsigned short, int, long, and long long, as well as float and double, have more bits than can fit in a single byte. We break those bits into bytes and put them in adjacent spots in memory, but we have two orders we could do that in.

Big-endian numbers put the big end of the number (the most-significant place value) first, meaning at the smallest address. This is similar to how decimal numbers are written in left-to-right languages like English and is the order used in most network standards and protocols.

Little-endian numbers put the little end of the number (the least-significant place value) first, meaning at the smallest address. This is similar to how decimal numbers are written in right-to-left languages like Arabic and is the order used in most desktop and notebook computer hardware.

Suppose I store int date = 20240118 in memory. I’m running a little-endian computer so it will be stored in little-endian byte order.

An int needs 4 bytes, so let’s store it in addresses 0xffff80 through 0xffff83. First we convert to hexadecimal (20240118 == 0x134d6f6), then break it into bytes 01 34 d6 f6, and then put those in memory little-end first

Address Value Note
ffff80 0xf6 lowest-order byte
ffff81 0xd6
ffff82 0x34
ffff83 0x01 highest-order byte

It’s common to display memory either with small addresses at the bottom or at the top; feel free to without changing the table’s meaning.

Suppose I store short[3] date = {0x2024, 0x1, 0x18}; in memory. I’m running a little-endian computer so it will be stored in little-endian byte order, but the order of elements in an array is always smallest index at smallest address.

An short needs 2 bytes an there are three shorts in the array, so let’s store it in addresses 0xffff80 through 0xffff85.

Address Value Notes
ffff80 0x24 Low-order byte of first value
ffff81 0x20 High-order byte of first value
ffff82 0x01 Low-order byte of second value
ffff83 0x00 High-order byte of second value
ffff84 0x18 Low-order byte of third value
ffff85 0x00 High-order byte of third value

It’s common to display memory either with small addresses at the bottom or at the top; feel free to without changing the table’s meaning.

Suppose I send short[3] date = {0x2024, 0x1, 0x18}; over a network. Networks assume big-endian byte order, but the order of elements in an array is always smallest index at smallest address.

I would send the bytes in the following order:

  1. 0x20
  2. 0x24
  3. 0x00
  4. 0x01
  5. 0x00
  6. 0x18

Suppose each of the following xs is stored in memory starting at address 0x10000. What is the fifth byte of each, that is the byte is stored at address 0x10004?

Thing to store Fifth byte
uint64_t x = 0x0102030405060708uL 0x
int x[3] = {1,2,3} 0x
short x[5] = {1,2,3,4,5} 0x
short x[5] = {0x110,0x220,0x330,0x440,0x550} 0x
char x[8] = {1,2,3,4,5,6,7,8} 0x
struct { char a; char b; short c; int e; } x;
x = {1, 2, 0x304, 0x5060708}
0x
Little-endian answers: 0x04, 0x02, 0x03, 0x30, 0x05, 0x08
Big-endian answers: 0x05, 0x00, 0x00, 0x03, 0x05, 0x05

6 Composition with Pointers

Both memory (RAM) and files (disk) present themselves to us as an array of bytes. In memory, we call these indices addresses or pointers, as follows:

Pointers are themselves multi-byte numbers, and hence stored with endianness. The number of bytes per pointer is a fundamental design decision when making a new computer, and is sometimes references when describing the processor itself; for example, a 64-bit processor is one that stores 64-bit (8-byte) pointers6 Nuance: it is pointers, not addresses, that these numbers describe. I am writing this on a 64-bit computer, meaning it has 64-bit pointers. However, it has 48-bit addresses: the remaining 16 bits of the pointers are ignored by the processor..

You’ve programmed with pointers extensively in previous classes, so we won’t add much more about them here.

7 Functions

Sometimes we store data that has to pass through some kind of involved function to retrieve the data we actually want. Three types of functions are particularly common: serialization, compression, and encryption.

7.1 Serialization

It is common to store data in computer memory in a somewhat scattered way, with gaps where nothing is stored and cached values that can speed up future work mixed in with the core information we are storing. When sharing the data over a network or storing it in a file it is common to extract just the essential information and bundle it up into a compact sequence of bytes. There are multiple names for such a sequence of bytes, but serialization of the data is a relatively common one.

Functions that serialize data have many forms. Common objectives in their design include

These objectives are generally in conflict with one another, meaning many functions are in use each with its own balance of these objectives.

Suppose my code contains a doubly-linked list of four 16-bit integers with pointers to both the head and tail node of the list. It is likely that the nodes are scattered around memory somewhat based on the order in which I allocated them and other data structures.

A human-readable serialization of this list might be the 20-byte string [124, 128, 225, 340], or 5b 31 32 34 2c 20 31 32 38 2c 20 32 32 35 2c 20 33 34 30 5d.

A compact serialization of this list might be the 8 bytes of the values in a little-endian array, or 7c 00 80 00 e1 00 54 01.

A fast-to-run serialization of this list might move the nodes together but keep all ten 32-bit pointers (as well as the four 16-bit values) in the serialization, using at least 48 bytes.

7.2 Compression

Lossless compression functions apply an invertible function to data when storing it and the function’s inverse when reading the stored data. They pick the function with the hope that common values will result in smaller results than uncommon values will. The quality of lossless compression depends on how accurately it estimates what values will be common; values it expects will be uncommon tend to get bigger, not smaller, when compressed.

Suppose I expect most bytes to be small numbers like 0, 1, and 2 with only rare larger values. I might then compress a string of bytes into a string of bits as follows:

  1. Find the place of the most-significant 1 bit in the byte, call that place n
  2. Output n 1 bits, then a 0 bit, than the bits from the byte less significant than the most-significant 1 bit.
byte bits
00000000 0
00000001 10
00000010 1100
00000011 1101
00000100 111000
00000101 111001
11111101 1111111101111101
11111110 1111111101111110
11111111 1111111101111111

If we’re correct and most bytes are 0 or 1, this resulting bit string could be less than a quarter as long as the uncompressed bytes. If we’re wrong and most bytes are larger numbers, this compression could double the length of the value.

There are also functions called lossy compression. Conceptually these are the composition of two functions: one changes the input to a different by similar value that will take less space to store, and the other is a lossless compression function. Similarity is tricky to define and most successful lossy compression functions are each limited to a single domain (such as audio or video) where research has established a good model of similarity.

7.3 Encryption

Encryption functions are invertible functions that take an input byte sequence and produce an output byte sequence that generally has the same or similar size but that looks like a nonsensical string of random bits. Only people who know the secret of how that particular encryption function worked can invert the function and get back meaningful data.

The most common encryption functions are not themselves secret; instead, they have a special parameter (called a key) that is secret and cannot be readily guessed: that is, it is long enough that brute-force trying all combinations takes too long to be practical and the function uses it in a way that makes deriving the key from the function and the encrypted output infeasible.