This is an archived copy of a previous semester's site.

Please see the current semester's site.

Authentication

Don’t write your own security code!

Secure implementations of security protocols involve many techniques we have not and will not cover in this class, including:

This page’s goal is that you understand the idea behind these methods, not that you can build them yourself.

1 Building Blocks

Summary
Cryptographically-secure random number generator
A single operator with no input that generates completely unpredictable bits.
Cryptographically-secure hash

A single function h(m)=Hh(m) = H where

  • mm (the message) is an arbitrary sequence of bytes
  • HH (the hash) is a bitvector of a fixed size determined when hh is designed
  • Given an HH, finding an mm that generates it is very hard.
Symmetric cipher

Two functions, e(p,k)=ce(p,k) = c and d(c,k)=pd(c,k) = p and one key kk where

  • pp is a plaintext (not-encrypted message) of any length
  • cc is a ciphertext (encrypted message) of similar length to pp
  • kk is a shared key, a fixed-size bitvector known to only the communicating parties
  • finding pp from cc without knowing kk is very hard.
Asymmetric cipher

Two functions, u(m,ku)u(m,k_u) and v(m,kv)v(m,k_v) and a pair of keys, public key kuk_u and private key kvk_v, where

  • mm is a fixed-size bitvectors, with a size bounded by the size of the two keys
  • u(m1,ku)=m2    v(m2,kv)=m1u(m_1,k_u) = m_2 \;\Leftrightarrow\; v(m_2,k_v) = m_1
  • finding kvk_v given kuk_u is very hard.
Key Exchange
A protocol by which two agents can speak in public and end up both knowing the same random number without anyone else knowing what random number it is.

1.1 Random numbers

Many security activities depend at some point on a random number being used. These need to be random in the sense that there is no ability to guess them, even if you know a great deal about the code that created them.

Most software random number generators are actually pseudo-random number generators, meaning they are created by a deterministic subroutine with some internal state; if you know the value of that state, you can predict the sequence of random numbers perfectly, and observing a sequence of generated numbers is sufficient to re-construct the state.

Cryptographically-secure random number generators instead rely on some form of entropy harvesting to collect unpredictable bits from sources invisible to outsiders. For example, as I type this page the low-order bit of the microsecond at which I press each key may appear to be pure entropy—that is, without pattern or meaning. We could likewise harvest the low-order bits of each mouse movement, CPU temperature reading, etc. We don’t actually know that such measurements are random; there might be a pattern we haven’t noticed, and if so even secure algorithms may be breakable by attackers that know those patterns. However, it’s almost certainly more secure than using something entirely predictable like the output of a pseudo-random generator seeded with the time of day.

Modern operating systems typically harvest likely sources of entropy and maintain an entropy pool. Since entropy cannot be generated on demand, there is always risk that the entropy pool will not contain enough bytes for a needed cryptographic purpose.

Each operating system has its own way of providing access to its entropy pool. These are accessible in Python through the secrets library. For example,

print(secrets.token_hex(4))

prints an unpredictable 4-byte value such as

f9bf789a

1.2 Hashes

A hash function takes an input of any size (often generalized as a byte stream) and returns an output of a fixed size (often generalized as a large binary number) such that changing any byte of the input would change the output too.

{.aside} The word hash is used to refer to (a) the output of this function, (b) the function itself, and (c) the action of running this function. It is grammatically correct to say I hashed it with my favorite hash to get the following hash: f4ca408dc68ca2776e5d294e8f1166b56ee5af9b.

A crytographically-secure hash function is not feasibly inevitable. That is, given the output of the hash function and the code of the function itself, finding an input that would generate that output requires work equivalent to guessing random inputs until one happens to have the desired output.

It is very difficult to prove that a hash function is crytographically secure, and we have a long history of compromising one hash function (e.g., by figuring out how to rule out 99% of all inputs so we only have to guess 1% as many) and adopting another.

You’ve written insecure hashes in the past. Python has several secure ones in the library hashlib. For example,

print(hashlib.sha256(b'cs340').hexdigest())
print(hashlib.sha256(b'cs 340').hexdigest())

prints

b82754edf4ae29f1f3b3d98bed396722a62038664956f8b6e828d5eef422e09e
041cd9e2bd09e6ecf6eaeec7423df42b85ce2fd289746b5304da73a34dd70058

1.3 Symmetric ciphers

A symmetric cipher or symmetric-key cipher is a pair of functions, one to encrypt and one to decrypt, each taking two inputs: a message of arbitrary length and a limited-size secret key. They should have the following properties:

Getting the first two properties is easy; getting the third is hard, contributing to a history of finding a weakness in one cipher and picking another to replace it.

No symmetric ciphers are available in the Python standard library.

A very insecure example of the idea is

def bad_encode(m,k):
    return bytes((b+k)&0xFF for b in m)
def bad_decode(m,k):
    return bytes((b-k)&0xFF for b in m)

print(bad_encode(b"this is a test", 3))
print(bad_decode(bad_encode(b"this is a test", 3), 3))

which prints

b'wklv#lv#d#whvw'
b'this is a test'

1.4 Asymmetric ciphers

An asymmetric cipher or public-key cipher1 Note that public-key is an overloaded term; for example, Diffie-Hellman does not use the type of asymmetric cipher described here, but is asymmetric in a different way and is also sometimes called a public-key protocol. is a pair of functions that can each both encrypt and decrypt, but only bounded-size inputs. Keys come in pairs. If (f1,f2)(f_1, f_2) is an asymmetric cipher and (k1,k2)(k_1, k_2) is a key pair then

The cipher should also follow the security property:

In practice, only a limited set of asymmetric ciphers are known and they do not offer quite as good security as no better than brute force, but they do have (near) proofs of difficulty to break if

Unlike symmetric ciphers, of which there are many, only a few asymmetric ciphers have ever been popular. RSA2 RSA is short for Rivest-Shamir-Adleman, named after Ron Rivest, Adi Shamir, and Len Adleman, its inventors. It was previously invented by Clifford Cocks, but Cocks’ document describing it was given a top-secret classification by the British government and not made public until 1997. is the most common by far.

One of the best-known asymmetric ciphers is RSA, which uses modular exponentiation to operate. An insecure implementation of modular exponentiation (susceptible to timing attacks) is part of the Python built-ins in the pow(base, exponent, modulus) function.

def pub_f(m,pubkey):
    return pow(m, 0x10001, pubkey)
def priv_f(m, privkey, pubkey):
    return pow(m, privkey, pubkey)
    
pubkey  = 0x839468dfd8e096a36c2af0176ff62a98912a0d412eb40c75ce11de714c27689cb
privkey = 0x35d1e215d19987b43fbdce67dfc97281bfdd2eaa938ae386cf1eb744052a1ec21
plain = int.from_bytes(b'this is a test', 'little')

cipher_pub = pub_f(plain, pubkey)
decipher_priv = priv_f(cipher_pub, privkey, pubkey)
print(hex(cipher_pub))
print(decipher_priv.to_bytes(14, 'little'))

cipher_priv = priv_f(plain, privkey, pubkey)
decipher_pub = pub_f(cipher_priv, pubkey)
print(hex(cipher_priv))
print(decipher_pub.to_bytes(14, 'little'))

prints

0x600969b36a9593c85ba14cd3b1b5ff4cf105d9cc50ec9eb0b9f336bfe61bca94c
b'this is a test'
0x456cd39d4ef8abc9e2e724e20266647ef2cfd8efcdaa286f63a1bc7a1ab1b3038
b'this is a test'

1.5 Key exchange

The Diffie-Hellman key exchange3 There is some difference of opinion about the justice of this name. Whitfield Diffie and Martin Hellman published a paper describing it in 1976, but Hellman later stated the idea in the paper was that of their graduate student Ralph Merkle. is a method of causing two (or more) parties to agree on a single shared random integer without anyone else listening in being able to know what number they agreed upon.

The process requires identifying a cyclic group – that is, a set of values (integers are preferred) and an operator on elements of that set such that op(op(x,y),z)=op(op(x,z),y)op(op(x, y), z) = op(op(x, z), y). To be secure, the set should be large and the operation hard to undo (i.e., knowing both xx and op(x,y)op(x,y) should not make it easy to determine yy).

Modular exponentiation provides a cyclic group with the desired properties if a given modulus and base are chosen. An insecure implementation is available with the Python pow function.

If two people want to create a shared secret, they

  1. Agree (in public) on an initial random number gg
  2. Each (privately, without telling anyone at all) pick a second random number, their private key
  3. Each applies the operator to gg and their key, sharing the result (in public) with the others
  4. Each applies the operator to what they received and their key to attain their shared secret
person A knows shared publicly person B knows
ff, gg
aa bb
f(g,a)f(g, a), f(g,b)f(g, b)
f(f(g,b),a)f(f(g,b), a) f(f(g,a),b)f(f(g,a), b)

There are enough steps that an example is more distracting than helpful in this page. Wikipedia has a detailed example.

Key exchange can be used to implement many other security primitives, but doing so is generally nontrivial. Its simplest practical use is to establish a key for use in a [symmetric cipher].

2 Using the tools

2.1 Signatures

Task
Verify that a document has been approved in its current form by a particular individual.
Solution

Have them encrypt a hash of the document with their private key; the encrypted hash is called their signature.

To check the signature,

  1. get their public key (possibly with a certificate if their key is not one you already know)
  2. use it to decrypt the signature to get some value xx
  3. hash the document yourself to get another value yy
  4. if x=yx = y, you have confidence that they signed the document

With the usual caveat about the insecurity of pow, an RSA-based signature looks like:

def sign(msg,priv,pub):
  assert n >= 1<<256, "Can only sign if the key is larger than a hash"
  h = int.from_bytes(hashlib.sha256(msg).digest(), 'big')
  return (msg, pow(h,priv,n))

def check_signature(msg, sig, pub):
  assert n >= 1<<256, "Can only sign if the key is larger than a hash"
  h = int.from_bytes(hashlib.sha256(msg).digest(), 'big')
  e = pow(sig,0x10001,pub)
  return h == e

2.2 Should I trust you?

Task
Convince someone you don’t know that you are who you say you are.
Solution
  1. Have someone who knows you both sign a document saying this person’s public key is xx.
  2. Give that document, called a certificate, to the skeptic who doubts your identity.
  3. The skeptic then generates a random number yy, encrypts it with your public key, and tells you the encrypted result zz
  4. You decrypt zz with your private key and tell the skeptic it’s really me: your number was yy

The internet today uses a few well-known certificate authorities who build entire businesses on validating website identity and signing certificates.

A key insight about digital certificates is that presenting the certificate is just half of the authentication process; it must be followed with a challenge to verify you can use your private key, and that private key is never given away. It’s a bit like (but much more secure than) an ID card with a picture of you on it: you have to have your card (which others could steal) and your face (which you never intentionally give to anyone) to use it.

There are alternative ways of establishing trust without a centralized set of certificate authorities, but the approach described here is dominant at the time of writing4 2024.

2.3 Storing passwords

Task
I want the computer to recognize my password, but even the root account not to be able to learn what my password is.
Solution

Store only a hash of the password.

Note that a simple implementation of this makes possible exploits like rainbow tables where the hashes of a large number of plausible passwords are precomputed to compare against the hashes stored on a computer. One partial solution to this is to salt the password before hashing: we generate a random value, store it with the password, and include it and the password in the hash.

Passwords on the internet are much less secure than this. We hope servers store only hashes (though verifying this is hard), but we still have to get the password to them to hash so they typically are transmitted in their raw form. Thus, if your browser is remembering your passwords, it cannot be remembering only the hashes because it needs to enter the actual passwords into web forms for you. However, if you are good about using a different password for each site, browser storage of them can be more secure than typing them manually as it prevents you accidentally providing one site with another site’s password.

A better approach to web passwords is to use a password manager. This still has to store your passwords, but it can encrypt them using a symmetric key derived from your master password, which it stores only as a hash, so that compromise of the password manager has limited loss. It can also generate random virtually-unguessable passwords for each site and continue to provide access to your passwords even if you are not on your usual browser.

2.4 Let’s talk privately

Task
Initialize private communication with a new acquaintance in a public place.
Solution

Since communication has patterns that can be used to bypass the security of public key cryptography, we need to establish a shared key for use with a symmetric cipher. Such a key is generally little more than a random number, so we can use Diffie-Hellman to generate it.

Of course, first we need to agree on which symmetric cipher to use, and possibly share certificates to make sure we are who we say we are; and Diffie-Hellman involves several communication steps itself, so establishing this communication requires a multi-step back-and-forth protocol. One popular protocol for this is called TLS.

2.5 Blockchain

Task
Have a group of mutually-mistrustful computers agree on one sequence of events.
Solution

Create an indelible ledge of past events, a linked list you can add to but not remove from. A node in such a list is a simple pair: (new entry, hash of previous node). Any computer can propose a new node to add to the list.

To create agreement, add a protocol for deciding which of a set of competing nodes (i.e. nodes claiming the same node as their previous node) wins. There are many variants of this step, but commonly they include

  • A progress-based rule like the longest chain wins.
  • A competitive rule such whoever invested the most electricity wins or whoever has the biggest account on the chain wins.
  • User identification via signatures to manage the competition.

2.6 Single Sign-On

Task
Have one system (an identity provider) handle authenticating users and another system (a service provider) use that authentication without repeating it.
Solution

There are a surprising variety of competing protocols here, but the common components are

  1. The service provider tells the user agent a list of identity providers it trusts
  2. The user agent picks one of those identity providers, authenticates with them, and asks them for a certificate asserting not their public key but rather the identity information the service provide wanted
  3. The user agent sends that certificate to the service provider
  4. In some protocols (Oauth2 for example) the certificate includes an act on behalf of this user key the service provide can use to talk directly to the identity provider.

2.7 Log in

Task
Send a server my log-in information ensuring only the server and no one else knows it.
Solution

The server uses a certificate to establish its identity and provide its public key (kserverk_{\text{server}}).

The server and client use key exchange to create a temporary key (ktempk_{\text{temp}}) for a symmetric cipher.

The client sends a different key (ksharedk_{\text{shared}}) encrypted by the server’s public key (e(kshared,kserver)e(k_{\text{shared}}, k_{\text{server}})) over the symmetric cipher channel. Only the other machine with the temporary key can read the message and only the server can understand it, meaning the client now has confidence it is talking to the server and only the server when it uses ksharedk_{\text{shared}}.

The server asks the client to authenticate by sharing some secret, such as a password. Because the client knows it is talking to the server, no one else, it is safe to share that password with it. Because only the client knows the password, the server can use it to identify the client.

3 Side channels

Having a secure protocol and entropy source is not sufficient to have a secure system. Humans are often a weak point, picking bad passwords and sharing them too freely. However, side channels are also a major possible weakness.

A side channel is something that can carry information, but was not the communication channel the designer of the system had in mind. There are many side channels and may side channel attacks, but one example will suffice to show the complexity involved in anticipating and preventing them.

Suppose I want to learn the randomly-generated private key you used in Diffie-Hellman. I look through your code and consult the specs of your processor and notice that it takes your code 2 microseconds to process each 0-bit and 3 microseconds to process each 1-bit. There are a few other aspects that are hard for me to time perfectly, like how long it takes me to see your messages, but after watching a few hundred DH key exchanges I get a reasonable estimate on those too.

Now suppose for a particular 32-bit DH session I time your code and determine you have 13 1-bits in your key. To brute-force guess your key, I only need to check the 347,373,600 32-bit numbers that have 13 1-bits, not all 4,294,967,296; a savings of approximately 12×.

If I can add another side channel I might be able to reduce this space further. For example, maybe I learn that the power your CPU draws when processing a 1 bit is higher if the bit was preceded by a 0 than if it was preceded by a 15 This may sound strange, but modern processors try to guess what’s going to happen next to increase CPU performance, which can have this kind of side-effect.. Now I know not just how many 1 bits there are, but how clustered they are as well, further reducing the search space.