Bits and Beyond

Claude Shannon founded the field of information theory. A core fact in information theory is that there is a basic unit of information, called a bit1 a portmanteau of ”binary digit” or a bit of entropy. Roughly speaking, a bit is an amount of information that is about as surprising as the result of a single coin flip. In the phrase thank you the word you has less than a bit of entropy; most of the time someone says thank the next word they say is you, so adding that word provides very little more information. On the other hand, the word that comes after is in Hello, my name is has many bits of entropy; no matter what the word is, it was quite surprising.

Claude Shannon performed an experiment to determine the bits of entropy of the average letter in English. You can perform a similar experiment to measure the bits of entropy of a human language of your choice.

  1. Find a large corpus of example text (the larger the better)
  2. Write a program that repeatedly
    1. selects a random substring from the corpus, long enough to give a little context (perhaps 50 characters)
    2. prompt the user to type what the next character will be
    3. track the number of correct and incorrect guesses

After guessing correctly on r out of g total tries, an estimate of the bits of entropy per character in the corpus is log2(g ÷ r).

Not all languages have the same information per character. Shannon suggested English had roughly 1 bit per letter2 Shannon, Cluade E. (1950), Prediction and Entropy of Printed English, Bell Systems Technical Journal (3) pp. 50–64., while languages that use ideograms have many more.

It is likely that bits per second of speaking is more constant across languages, which could probably be tested by measuring the bits per character of subtitles and combing it with the timing information the subtitles contain, but I am unaware of any published effort to do this or any related measurement.

1 Digital Information

How much information can we transmit over a wire? If we put voltage on one end, because wire conducts well we’ll very soon see the same voltage at the other end. The more precise our measurement of that voltage, the more information we can collect. If we can distinguish between 1000 voltage levels, for example, we’ll get log2(1000) ≈ 10 bits of information per measurement; whereas a less sensitive voltmeter that can only distinguish between two voltage levels gets only log2(2) = 1 bit per reading. This seems to suggest that the more precise I can make the voltage generator and the more sensitive the voltmeter, the more information I can transmit per reading.

The problem with this assumption is that it takes longer to transmit higher-resolution data. This is partly a consequence of fundamental mathematical and physical laws, like the Heisenberg uncertainty principle, but it is also more tellingly a consequence of living in a noisy world. To tell the difference between 8.35 and 8.34 volts, you need to ensure that the impact of wire quality and the environment through which it passes contributes significantly less than 0.01 voltage error to the measurement; generally this requires watching the line for long enough to see what is the noisy variation and what is the true signal. By contrast, telling the difference between 10 volts and 0 volts is much simpler and much more robust against noise. It is quite possible to make several dozen good 1-bit measurements in the time it’d take to make one 10-bit measurement.

This observation led to advances in digital signals: signals composed of a large number of simple digits3 ”Digit” comes from a Latin word meaning ”finger” and, prior to the computer age, was used to refer to the ten basic numerals 0 through 9. ”Digital” meaning ”a signal categorized into one of a small number of distinct states” became common in the 1960s, though it was used by computing pioneers as early as the late 1930s. instead of one fine-grained analog4 ”Analog” (or ”Analogue” outside the USA) comes from a word meaning ”similar to” or ”along side”, suggesting that an analog signal has some direct correlation to the thing it represents. signal. We can communicate much more information with much less error by transmitting a large number of single-bit signals than we can by transmitting a smaller number of signals with more information in each.

2 Saying Anything with Bits

Information theory asserts that any information can be presented in any a format that contains enough bits. There are many interesting theorems and algorithms about how this can best be done, but for most of computing we only need to consider a handful of simple encodings, described in the subsections below.

2.1 Place-value numbers

2.1.1 Base-10, decimal

When you were still a small child you were taught a place-value based numbering system. Likely at that age you never considered why we called it place-value, but the name is intended to suggest that the value of each digit depends not only on the digit itself, but also on its placement within the string of digits. Thus in the number string 314109, the first 1’s value is a hundred times larger than the second 1’s value. If we write out the full place values

105 104 103 102 101 100
3 1 4 1 0 9

we can see that the number’s total value is 3 × 105 + 104 + 4 × 103 + 102 + 9 × 100 or three hundred fourteen thousand one hundred nine.

2.1.2 Base-2, binary

There is no particular magic to the 10s in the decimal example above. Because it is easier to distinguish two things than ten, we might reasonably try to use 2s instead of 10s. Instead of the ten digits 0 through 9 we use the two digits 0 through 1. Thus in the base-2 number string 110101 the first 1’s value is thirty-two times larger than the last 1’s value. If we write out the full place values

25 24 23 22 21 20
1 1 0 1 0 1

we can see that the number’s total value is 25 + 24 + 22 + 20 or fifty-three.

Base-2 numbers of this sort are widely used to convert sequences of single-bit data into a larger many-bit datum. Indeed, we call a binary digit a bit, the same word information theory uses to describe the fundamental unit of information, and often refer to any single-bit signal as being a 1 or 0 even if it is more accurately a high or low voltage, the presence or absence of light in a fiber, a smooth or pitted region on an optical disc, or any of the wide range of other physical signals.

Math works the same in base-2 as it does in base-10 except it is two, not ten, that causes a carry digit:

            1      11      11    1 11    1 11
  1011    1011    1011    1011    1011    1011
+ 1001  + 1001  + 1001  + 1001  + 1001  + 1001
------  ------  ------  ------  ------  ------
             0      00     100    0100   10100

Binary is useful for hardware because it takes less space and power to distinguish between two voltage states (high and low) than between three or more. However, for most uses we find it more useful to cluster groups of bits together, typically either in 4-bit clusters called nibbles or nybbles or in 8-bit clusters called bytes or octets.

2.1.3 Hexadecimal

Base-16, or hexadecimal, is useful because humans can read it without getting as easily lost in the long sequence of 0s and 1s, but it has a trivial mapping to the binary that hardware actually stores.

Hexadecimal digits are taken from the set of nibbles: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f}. Place-value notation is used with base 16. Thus the number d0216 (also written 0xd02) represents the number

162 161 160
d 0 2

d × 162 + 2 × 160 = 13×256 + 2×1 = 3328 + 2 = 3330

Hexadecimal is particularly noteworthy because it is easy to convert to and from binary: each nibble is simply 4 bits. Thus

d 0 2
1101 0000 0010

Math works the same in base-16 as it does in base-10 except it is sixteen, not ten, that causes a carry digit:

            1       1       1    1  1    1  1
  d0b2    d0b2    d0b2    d0b2    d0b2    d0b2
+ 300f  + 300f  + 300f  + 300f  + 300f  + 300f
------  ------  ------  ------  ------  ------
             1      c1     0c1    00c1   100c1

2.1.4 Bytes or octets

Base-256 uses octets or bytes, each being 2 nibbles or 8 bits. Octets are typically written using pairs of nibbles directly; thus the digits are taken from the set of bytes: {00, 01, 02, … fd, fe, ff}. Humans almost never do math directly in base-256.

Octets are noteworthy because, while processor mathematical circuits use base-2, most of the rest of computing is based on octets instead. Memory, disk, network, datatypes, assembly—all used bytes, not bits, as the smallest5 This is a gross over-simplification. A serial connection communicates in bits, a parallel in a fixed number of bits equal to the number of wires, etc. However, most protocols and interfaces present a byte-based interface. unit of communication.

2.2 Base-2 logs and exponents

Power-of-two numbers show up in many places in hardware and hardware-interfacing software. It is worth learning the vocabulary used to express them.

Value base-10 Short form Pronounced
210 1024 Ki Kilo
220 1,048,576 Mi Mega
230 1,073,741,824 Gi Giga
240 1,099,511,627,776 Ti Tera
250 1,125,899,906,842,624 Pi Peta
260 1,152,921,504,606,846,976 Ei Exa

In all cases above, the i is sometimes dropped to just say (e.g.) G instead of Gi. The i clarifies that we mean powers of 2 like 1024, not powers of 10 like 1000. G could mean either 230 or 109, numbers the differ by about 7%; which one is meant can only be guessed from context unless the clarifying i is present.

Most software developers I know have memorized the powers of two between 20 and 210. This allows them to efficiently recognize and work with all powers of two using a trick to split larger powers into a letter and a smaller power. For example, 227 = 27 220 = 128M. This pattern works for any power of 2: the 1’s digit of the exponent becomes a number, the 10’s digit becomes a letter. Thus

Value Split Written
227 27 220 128M
23 23 20 8
239 29 230 512G

Logarithms with base 2 (written log2(n) or lg(n) or sometimes just log(n)) do the inverse of this: lg(64G) = lg(26 230) = lg(236) = 36.

Fill in the rest of the following table.

Exponent Written As
17 128K
3
38
11
256M
16G
32
Answers: 8, 256G, 2K, 28, 34, 5

2.3 Negative numbers

We are accustomed to having a special symbol that tells us if a number is negative. When communicating in bits with just two symbols using a special symbol is not an option, so several other approaches to representing negative numbers are used.

There are three different ways of encoding negative numbers in common use today: one used for integers, one for floating-point numbers, and one for the exponent part of floating-point numbers. A fourth technique (called one’s complement) is sometimes taught but to the best of my knowledge never used.

The technique used for integers is called two’s complement and depends on access to a finite number of bits/digits/nibbles/bytes to store numbers in. Our examples will assume we have four digits.

Two’s complement picks a number equal to one-half of the maximum number we can write, rounded up, and decides that that number and all numbers bigger than it are negative. How negative? To understand that, we need to observe a ring-like behavior of numbers.

What is a 4-digit number 0000 - 0001? Rather than trying to intuit the answer, let’s pad out the 0000 with a fifth digit (lets arbitrarily pick 5) to make it larger than 0001, giving us 50000 - 0001; then subtract as usual to get 49999; then remove that fifth digit we added to get 9999. That may seem strange, but it does satisfy the 4-digit version of the axiom (x+y)y=x(x + y) - y = x: 9999 + 0001 is 10000 which when converted back to 4 digits is 0000.

The exact same process applied to base-2 instead of base-10 is two’s complement, the version of signed integers used is almost all computers today. Two’s complement is nice because the three most common mathematical operators (addition, subtraction, and multiplication) work the same for signed and unsigned values. Division is messier, but division is always messy.

0000 0 0001 +1 0010 +2 0011 +3 0100 +4 0101 +5 0110 +6 0111 +7 1000 −8 1001 −7 1010 −6 1011 −5 1100 −4 1101 −3 1110 −2 1111 −1

Visualization of two’s complement numbers in binary.

Zero is all zero bits. Adding and subtracting 1 makes numbers one larger or smaller, except at the transition between 01...1 and 10...0.

There is one more negative number than there are positive numbers.

Two’s complement is commonly used for integral values. Swapping sign is equivalent to flipping all bits and adding one to the result.

Fill in the rest of the following table. Assume you are using 6-bit numbers. Answers are in footnotes.

Decimal Two’s-C
5 000101
−5 111011
11
−1
−15
110011
011111
010000
Answers: 001011, 111111, 110001, −13, 31, 16

2.4 Floating-point numbers

At one time there were many alternative representations of non-integral numbers, but IEEE standards 754 and 854 define a particular representation that has received very widespread implementation. Numbers in this and related formats are called IEEE-style floating-point numbers or simply floating-point numbers of floats.

Floats are a binary extension to the idea of scientific notation, and will be covered more in CS 357.

2.5 Numbering everything else

Not all data we care about are numeric. There are both non-numeric primitive types like characters and aggregate types made up of two or more values of other types.

When we want to introduce a primitive type, we need to define a sufficiently large number of bits and then create a mapping between each value of that type and a particular bit sequence. Since every bit sequence can be thought of as representing a number, we could call this process enumeration, but there does not need to be any obvious structure or meaning to how the bit patterns or their number meanings relate to the represented concept.

2.5.1 Character encoding

Text is pervasive in today’s world, and is thus often represented in binary. However, there is no one right way to represent characters as bits. Indeed, it is not easy to even decide what is a character and what is not; Unicode, one of the most extensive of the common character enumerations, has decided that R, R, ℛ, ℜ, and ℝ, are all different characters but that R, R, R, R, and R are all the same character.

Character encodings are complicated in part because there are so many of them and they have so many internal rules and special cases. However, each encoding is just a big table of associations like The Angstrom sign (Å) is character number eight thousand four hundred ninety-one, or 10000100101011 in binary. Designing a character encoding is a lot of work; detecting which one is being used can be tricky; and some use different numbers of bits for different characters, which adds an additional level of programming complexity. That said, most uses of characters can simply rely on some other piece of code to keep track of all the various decisions made by the encoding designers.

One particular character encoding we’ll look at in this class is UTF-8.

2.5.2 Custom encodings

It is common for programmers to wish to represent a set of primitive values other than numbers and characters. For example, OpenGL decided that it wanted a type to represent kind of shape and gave it ten values: a group of points is 0, a group of line segments is 1, a ring of contiguous line segments is 2, an open path of contiguous line segments is 3, and so on.

These kinds of custom encodings are very common, and are generally called enumerations or enums. We’ll see more enums later in this course.

3 Which comes first?

Processors handle numbers in binary. A binary number has a high-order bit and a low-order bit, with many bits in between. It does not intrinsically have a first or last bit; Sometimes we think of the high-order bit as first (for example, when we write it in English, we write the high-order bit first) and sometimes as last (for example, when we do math we get to the high-order bit last). Fortunately, we don’t need to care about the first-last distinction because the processor doesn’t interact with numbers in that way.

In general, computer memory handles numbers in bytes, base-256. A multi-byte number has a high-order byte and a low-order byte, generally with several bytes in between. Again, first and last are not intrinsic when describing these numbers. However, memory does represent each byte as having a location, like an index into a list or array. As such, the processor has to decide, when breaking a larger number into bytes, whether it is going to put the high-order or low-order byte in the first spot in memory. Other parts of computers, such as disks and network drivers, also need to be told bytes in a first-to-last order.

Processors (and people) are not consistent in what decision they make. Some, like x86, x86-64, many file standards, addition, and Arabic numerals in the original Arabic, put the low-order byte first and are called little-endian. Others, like MIPS, PowerPC, many network standards, numerical comparison, and Arabic numerals in most European languages, put the low-order byte last and are called big-endian. A few, like ARM and students in classes like this, understand both and can be configured to act in either a big-endian or little-endian mode.

In English (and most other languages today) we use base-10 and Arabic numerals; thus thirty-five is written 35. The high-order digit is on the left, the low-order digit is on the right. When Arabic numbers are placed inside of English text it looks big-endian because English is read left-to-right. However, the same number inside of Arabic text looks little-endian because Arabic is read right-to-left.

It is important to note that endianness applies only when breaking a primitive value into pieces and putting those pieces in some sequential order. Endianness does not apply inside the processor, where numbers are stored in whole without an arbitrary order. Endianness also does not apply to naturally-ordered values like the elements of a list, as these are placed in their natural order on both big- and little-endian computers.

Consider an array (a list stored contiguously in memory) containing two 16-bit numbers, [0x1234, 0x5678]. If this array were placed in memory at address 0x108, we’d find the following in memory:

address little-endian big-endian
108 34 12
109 12 34
10a 78 56
10b 56 78

Observe that the order of bytes within each number changes based on endianness, but that all of the following are unimpacted by endianness:

4 Intentional redundancy

Although digital systems are much less sensitive to noise than are analog systems, sometimes noise still does impact the signal, and when it does the consequences are potentially quite large, particularly if the noise flips a high-order bit. In part because of this, it is common to include intentional redundancy in digital signals. By including information that could have been fully determined without its presence, it becomes possible to compare that information with what would normally have been generated and use that to see if the information has arrived in its intended form.

4.1 Detection versus correction

The simplest form of redundancy to design allows you to detect if a group of bits has been modified in small ways. The small ways caveat is important: there is no form of redundancy that is proof against large changes to the information, as the redundant data itself is encoded in bits and if those bits are changed to match the changes to other bits, the data will look consistent.

A single extra bit can be enough to detect that a bit got flipped due to noise. To be able to automatically correct that flipped bit in an n-bit signal requires at least log2(n) extra bits, as anything less than that has insufficient information content to identify which bit needs correction.

Measure the detection and correction ability of redundancy in a human language of your choice.

  1. Find a large corpus of example text (the larger the better)
  2. Write a program that repeatedly
    1. selects a random substring from the corpus, long enough to give a little context (perhaps 2 dozen words)
    2. randomly either (a) leave it alone or (b) change one word in the central half of what you show
    3. prompt the user to identify the text as either unchanged or changed, and if changed to identify what word was changed and what it was supposed to be
    4. track the number of correct and incorrect identifications and corrections

While correction may seem desirable, often the easier detection criteria is sufficient because in many cases the recipient of detected-bad data can simply request the sender to send it again, repeating as needed until a good copy is obtained.

4.2 Parity, checksums, error-correction codes, and digests

There are many different techniques used to add structured redundancy to sets of bits. This section describes just a few.

Parity

The parity of a set of bits is 0 if an even number of the bits are 1s and 1 if an odd number of the bits are 1. Adding a parity bit (sometimes called a check bit) to a number is a simple way to make single-bit errors detectable.

The following table shows 1-byte values and their check bit. Identify which ones have a parity error.

Byte Check Has parity error
00000000 0
01001101 1
11111000 0
10101010 0
00011100 1
01000010 1
11111111 1
Answers: error, correct, correct, error, error
Multiple parity

If you store the parity of multiple groups of bits, you can use the intersections of the groups that have errors to determine which bit had the wrong value and correct it. For example, given 11001010 we could compute the following parity bits:

Partity of is
11001010 0
1100     0
11  10   1
1 0 1 1  1

Changing any input bit will change a unique set of parity bits; for example, changing the second-most-significant bit will change the first three parity bits but not the fourth.

This is an example of how one could begin to design a correcting code, but it is not fully usable as is because an error in the parity bits themselves cannot be unambiguously identified and corrected. Full correctable redundancy is more complicated to design; common designs include convolution codes (based on Boolean polynomials) and Hamming codes (based on parity bits, but selected more carefully than the example above).

CRC

Many digital formats make use of a family of error detection algorithms collectively known as Cyclic Redundancy Checks or CRC. The IEEE has standardized (in IEEE 802-3) one, commonly called CRC32, that is used in common formats such as the PNG standard.

CRC’s are cyclic in the sense that they can be applied to any length input by repeating a simple procedure. This makes them one type of digest: an algorithm that accepts long inputs and produces small summary outputs. CRCs are similar to the other most common form of digest (called a hash), but are generally optimized for speed and for detecting bit-level changes. Other digests are used in security or memory allocation contexts and have different design goals.

Checksum

Checksum is a generic term for any easily-defined, easily-computed value that depends on a larger message. Parity, CRC, hashes, and many others are all examples of checksums. The word itself suggests another simple approach: treat each byte as a number and sum them all up.

There is not just one kind of checksum. Often when you see documentation refer to a checksum they either mean that they have a custom way of producing the error-detecting redundancy or that the specific method used is unimportant in the context of the documented features.

Although error detection and correction through redundant data is important for the success of all kinds of digital information, and commonly implemented in everything from computer memory chips to archive file formats, it is relatively uncommon for a programmer to need to think about checksums in detail. It is usually enough to know that messages may contain some error-checking information and that recipients may inform providers that a message arrived in a corrupted state.