This is an archived copy of a previous semester's site.
Please see the current semester's site.
Don’t write your own security code!
Secure implementations of security protocols involve many techniques we have not and will not cover in this class, including:
if
than else
or the like)This page’s goal is that you understand the idea behind these methods, not that you can build them yourself.
A single function where
Two functions, and and one key where
plaintext(not-encrypted message) of any length
ciphertext(encrypted message) of similar length to
shared key, a fixed-size bitvector known to only the communicating parties
Two functions, and and a pair of keys, public key
and private key
, where
Many security activities depend at some point on a random number being used. These need to be random in the sense that there is no ability to guess them, even if you know a great deal about the code that created them.
Most software random number generators
are actually pseudo-random number generators, meaning they are created by a deterministic subroutine with some internal state; if you know the value of that state, you can predict the sequence of random
numbers perfectly, and observing a sequence of generated numbers is sufficient to re-construct the state.
Cryptographically-secure random number generators instead rely on some form of entropy harvesting to collect unpredictable bits from sources invisible to outsiders. For example, as I type this page the low-order bit of the microsecond at which I press each key may appear to be pure entropy
—that is, without pattern or meaning. We could likewise harvest the low-order bits of each mouse movement, CPU temperature reading, etc. We don’t actually know that such measurements are random; there might be a pattern we haven’t noticed, and if so even secure algorithms may be breakable by attackers that know those patterns. However, it’s almost certainly more secure than using something entirely predictable like the output of a pseudo-random generator seeded with the time of day.
Modern operating systems typically harvest likely sources of entropy and maintain an entropy pool. Since entropy cannot be generated on demand, there is always risk that the entropy pool will not contain enough bytes for a needed cryptographic purpose.
Each operating system has its own way of providing access to its entropy pool. These are accessible in Python through the secrets
library. For example,
print(secrets.token_hex(4))
prints an unpredictable 4-byte value such as
f9bf789a
A hash function takes an input of any size (often generalized as a byte stream) and returns an output of a fixed size (often generalized as a large binary number) such that changing any byte of the input would change the output too.
{.aside} The word hash
is used to refer to (a) the output of this function, (b) the function itself, and (c) the action of running this function. It is grammatically correct to say I hashed it with my favorite hash to get the following hash:
f4ca408dc68ca2776e5d294e8f1166b56ee5af9b
.
A crytographically-secure hash function is not feasibly inevitable. That is, given the output of the hash function and the code of the function itself, finding an input that would generate that output requires work equivalent to guessing random inputs until one happens to have the desired output.
It is very difficult to prove that a hash function is crytographically secure, and we have a long history of compromising one hash function (e.g., by figuring out how to rule out 99% of all inputs so we only have to guess 1% as many) and adopting another.
You’ve written insecure hashes in the past. Python has several secure ones in the library hashlib
. For example,
print(hashlib.sha256(b'cs340').hexdigest())
print(hashlib.sha256(b'cs 340').hexdigest())
prints
b82754edf4ae29f1f3b3d98bed396722a62038664956f8b6e828d5eef422e09e
041cd9e2bd09e6ecf6eaeec7423df42b85ce2fd289746b5304da73a34dd70058
A symmetric cipher or symmetric-key cipher is a pair of functions, one to encrypt and one to decrypt, each taking two inputs: a message of arbitrary length and a limited-size secret key. They should have the following properties:
Getting the first two properties is easy; getting the third is hard, contributing to a history of finding a weakness in one cipher and picking another to replace it.
No symmetric ciphers are available in the Python standard library.
A very insecure example of the idea is
def bad_encode(m,k):
return bytes((b+k)&0xFF for b in m)
def bad_decode(m,k):
return bytes((b-k)&0xFF for b in m)
print(bad_encode(b"this is a test", 3))
print(bad_decode(bad_encode(b"this is a test", 3), 3))
which prints
b'wklv#lv#d#whvw'
b'this is a test'
An asymmetric cipher or public-key cipher1 Note that public-key
is an overloaded term; for example, Diffie-Hellman does not use the type of asymmetric cipher described here, but is asymmetric in a different way and is also sometimes called a public-key
protocol. is a pair of functions that can each both encrypt and decrypt, but only bounded-size inputs. Keys come in pairs. If is an asymmetric cipher and
is a key pair then
The cipher should also follow the security property:
In practice, only a limited set of asymmetric ciphers are known and they do not offer quite as good security as no better than brute force,
but they do have (near) proofs of difficulty to break if
P ≠ NPand
factorization ∉ P)
Unlike symmetric ciphers, of which there are many, only a few asymmetric ciphers have ever been popular. RSA2 RSA is short for Rivest-Shamir-Adleman, named after Ron Rivest, Adi Shamir, and Len Adleman, its inventors. It was previously invented by Clifford Cocks, but Cocks’ document describing it was given a top-secret classification by the British government and not made public until 1997. is the most common by far.
One of the best-known asymmetric ciphers is RSA, which uses modular exponentiation to operate. An insecure implementation of modular exponentiation (susceptible to timing attacks) is part of the Python built-ins in the pow(base, exponent, modulus)
function.
def pub_f(m,pubkey):
return pow(m, 0x10001, pubkey)
def priv_f(m, privkey, pubkey):
return pow(m, privkey, pubkey)
= 0x839468dfd8e096a36c2af0176ff62a98912a0d412eb40c75ce11de714c27689cb
pubkey = 0x35d1e215d19987b43fbdce67dfc97281bfdd2eaa938ae386cf1eb744052a1ec21
privkey = int.from_bytes(b'this is a test', 'little')
plain
= pub_f(plain, pubkey)
cipher_pub = priv_f(cipher_pub, privkey, pubkey)
decipher_priv print(hex(cipher_pub))
print(decipher_priv.to_bytes(14, 'little'))
= priv_f(plain, privkey, pubkey)
cipher_priv = pub_f(cipher_priv, pubkey)
decipher_pub print(hex(cipher_priv))
print(decipher_pub.to_bytes(14, 'little'))
prints
0x600969b36a9593c85ba14cd3b1b5ff4cf105d9cc50ec9eb0b9f336bfe61bca94c
b'this is a test'
0x456cd39d4ef8abc9e2e724e20266647ef2cfd8efcdaa286f63a1bc7a1ab1b3038
b'this is a test'
The Diffie-Hellman key exchange3 There is some difference of opinion about the justice of this name. Whitfield Diffie and Martin Hellman published a paper describing it in 1976, but Hellman later stated the idea in the paper was that of their graduate student Ralph Merkle. is a method of causing two (or more) parties to agree on a single shared random integer without anyone else listening in being able to know what number they agreed upon.
The process requires identifying a cyclic group
– that is, a set of values (integers are preferred) and an operator on elements of that set such that . To be secure, the set should be large and the operation hard to undo (i.e., knowing both and should not make it easy to determine
).
Modular exponentiation provides a cyclic group with the desired properties if a given modulus and base are chosen. An insecure implementation is available with the Python pow
function.
If two people want to create a shared secret, they
person A knows | shared publicly | person B knows |
---|---|---|
, | ||
, | ||
There are enough steps that an example is more distracting than helpful in this page. Wikipedia has a detailed example.
Key exchange can be used to implement many other security primitives, but doing so is generally nontrivial. Its simplest practical use is to establish a key for use in a [symmetric cipher].
Have them encrypt a hash of the document with their private key; the encrypted hash is called their signature.
To check the signature,
With the usual caveat about the insecurity of pow
, an RSA-based signature looks like:
def sign(msg,priv,pub):
assert n >= 1<<256, "Can only sign if the key is larger than a hash"
= int.from_bytes(hashlib.sha256(msg).digest(), 'big')
h return (msg, pow(h,priv,n))
def check_signature(msg, sig, pub):
assert n >= 1<<256, "Can only sign if the key is larger than a hash"
= int.from_bytes(hashlib.sha256(msg).digest(), 'big')
h = pow(sig,0x10001,pub)
e return h == e
this person’s public key is.
it’s really me: your number was
The internet today uses a few well-known certificate authorities who build entire businesses on validating website identity and signing certificates.
A key insight about digital certificates is that presenting the certificate is just half of the authentication process; it must be followed with a challenge to verify you can use your private key, and that private key is never given away. It’s a bit like (but much more secure than) an ID card with a picture of you on it: you have to have your card (which others could steal) and your face (which you never intentionally give to anyone) to use it.
There are alternative ways of establishing trust without a centralized set of certificate authorities, but the approach described here is dominant at the time of writing4 2024.
Store only a hash of the password.
Note that a simple implementation of this makes possible exploits like rainbow tables where the hashes of a large number of plausible passwords are precomputed to compare against the hashes stored on a computer. One partial solution to this is to salt the password before hashing: we generate a random value, store it with the password, and include it and the password in the hash.
Passwords on the internet are much less secure than this. We hope servers store only hashes (though verifying this is hard), but we still have to get the password to them to hash so they typically are transmitted in their raw form. Thus, if your browser is remembering your passwords, it cannot be remembering only the hashes because it needs to enter the actual passwords into web forms for you. However, if you are good about using a different password for each site, browser storage of them can be more secure than typing them manually as it prevents you accidentally providing one site with another site’s password.
A better approach to web passwords is to use a password manager. This still has to store your passwords, but it can encrypt them using a symmetric key derived from your master
password, which it stores only as a hash, so that compromise of the password manager has limited loss. It can also generate random virtually-unguessable passwords for each site and continue to provide access to your passwords even if you are not on your usual browser.
Since communication has patterns that can be used to bypass the security of public key cryptography, we need to establish a shared key for use with a symmetric cipher. Such a key is generally little more than a random number, so we can use Diffie-Hellman to generate it.
Of course, first we need to agree on which symmetric cipher to use, and possibly share certificates to make sure we are who we say we are; and Diffie-Hellman involves several communication steps itself, so establishing this communication requires a multi-step back-and-forth protocol. One popular protocol for this is called TLS.
Create an indelible ledge of past events, a linked list you can add to but not remove from. A node in such a list is a simple pair: (new entry, hash of previous node). Any computer can propose a new node to add to the list.
To create agreement, add a protocol for deciding which of a set of competing nodes (i.e. nodes claiming the same node as their previous
node) wins. There are many variants of this step, but commonly they include
the longest chain wins.
whoever invested the most electricity winsor
whoever has the biggest account on the chain wins.
identity provider) handle authenticating users and another system (a
service provider) use that authentication without repeating it.
There are a surprising variety of competing protocols here, but the common components are
act on behalf of this userkey the service provide can use to talk directly to the identity provider.
The server uses a certificate to establish its identity and provide its public key ().
The server and client use key exchange to create a temporary key () for a symmetric cipher.
The client sends a different key () encrypted by the server’s public key () over the symmetric cipher channel. Only the other machine with the temporary key can read the message and only the server can understand it, meaning the client now has confidence it is talking to the server and only the server when it uses .
The server asks the client to authenticate by sharing some secret, such as a password. Because the client knows it is talking to the server, no one else, it is safe to share that password with it. Because only the client knows the password, the server can use it to identify the client.
Having a secure protocol and entropy source is not sufficient to have a secure system. Humans are often a weak point, picking bad passwords and sharing them too freely. However, side channels are also a major possible weakness.
A side channel is something that can carry information, but was not the communication channel the designer of the system had in mind. There are many side channels and may side channel attacks, but one example will suffice to show the complexity involved in anticipating and preventing them.
Suppose I want to learn the randomly-generated private key you used in Diffie-Hellman. I look through your code and consult the specs of your processor and notice that it takes your code 2 microseconds to process each 0-bit and 3 microseconds to process each 1-bit. There are a few other aspects that are hard for me to time perfectly, like how long it takes me to see your messages, but after watching a few hundred DH key exchanges I get a reasonable estimate on those too.
Now suppose for a particular 32-bit DH session I time your code and determine you have 13 1-bits in your key. To brute-force guess your key, I only need to check the 347,373,600 32-bit numbers that have 13 1-bits, not all 4,294,967,296; a savings of approximately 12×.
If I can add another side channel I might be able to reduce this space further. For example, maybe I learn that the power your CPU draws when processing a 1 bit is higher if the bit was preceded by a 0 than if it was preceded by a 15 This may sound strange, but modern processors try to guess what’s going to happen next to increase CPU performance, which can have this kind of side-effect.. Now I know not just how many 1 bits there are, but how clustered they are as well, further reducing the search space.