Thread Safety

1 Things to Avoid

A function or data structure is thread safe if it behaves as expected when accessed concurrently by multiple threads. Generally, this means avoiding each of the non-thread-safe problems listed below.

1.1 Implicit State

The strtok function’s official description includes this:

On the first call to strtok(), the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be NULL.

This suggests that strtok stores the argument from its first call in some kind of hidden internal state and uses that internal state to handle subsequent calls.

What happens if two different threads each call strtok with different initial strings, then call it repeatedly with NULL to parse the rest of those strings? They end up sharing that same hidden state: once the second thread calls it, the first string is lost and the tokens of the second string are split between the two threads in whatever order the strtok calls happen to arrive.

A function is called reentrant (or in some cases the less-precise term thread safe) if it does not have any such hidden state. For example strtok_r is a reentrant version of strtok which requires the user to pass in a pointer to a state variable instead of using a hidden global variable to keep track of which string is being tokenized and how that tokenization is proceeding.

1.2 Data Race

In some cases, having one piece of state shared by several threads is the intent. Perhaps threads are communicating by reading and writing values to shared variables, or are working together to count how many of something they find, or are all accessing and updating a shared dictionary or mapping.

A data race1 A data race is the most common type of race condition, and sometimes just called a race condition occurs when one thread updates a variable while another thread is using it.

Suppose you are copying my name from the board into your notes. I wrote it as Dr. Tychonievich and you’ve copied over the first seven characters, Dr. Tych, when I change it to Prof. Luther; you then finish copying the remaining letters, getting Dr. Tychther due to a data race between my changing the board and your copying it letter-by-letter into your notes.

1.3 Deadlock

We avoid data races by synchronizing access to shared data, meaning only one thread accesses it at a time, forcing other threads to wait their turn. If we design this poorly we can get deadlock where each thread is waiting for something that can’t happen until a different thread stops waiting.

Let’s consider deadlock created by the Lock synchronization primitive. Imagine drawing a graph with a directed edge from each lock a thread holds to the lock that thread is waiting to acquire. Deadlock has occurred if and only if that graph contains a cycle.

For fancier synchronization primitives the definition of the graph is a bit more involved, but the same idea hold: deadlock occurs when there’s a cycle in the graph of the waiting-for relation. We call this condition circular wait, and it depends on three other, simpler ideas:

Together these four conditions are known as the Coffman Conditions after Edward Coffman Jr., who published them in 1971.

High-value exchange

Many action movies have a scene where person MM has big-case-of-money, person WW has super-valuable-widget, and both of them want to exchange the two but both think the other will try to steal both. Thus both run a process like

  1. acquire other person’s thing
  2. release my thing to other person

… causing a deadlock: each has one resource and is waiting on the other resource before proceeding.

Movies having an imperative to keep moving, these scenes generally resolve in a few seconds with something that moves the plot forward, but without such plot pressure they could conceivably stay blocked forever.

Code implementing High Value Exchange
"""
An example of deadlock created by a buyer and seller each insisting that they get what they want
(the item or money, respectively) before giving up what they have (the money or item, respectively).
Uses a Barrier as a thread-setup step (subclassing Thread and using __init__ for setup would also work).

Like all deadlock, this meets all four Coffman conditions:

1. mutual exclusion, because only one can have each asset at a time

2. hold and wait, because they hold their asset while waiting to get the other asset

3. no preemption, because neither can force the other to give up what they have

4. circular wait, because the hold and wait graph looks like this:

    [buyer] — waiting on → [item]
       ↑                      |
    held by                held by
       |                      ↓
    [money] ← waiting on — [seller]

   or, if we use locks as nodes and threads holding one while waiting for another as edges,

           buyer
           ————→
    [money]     [item]
           ←————
           seller
"""

from threading import Thread, Lock, Barrier

b = Barrier(2)
item, money = Lock(), Lock()

def buyer(item, money, b):
    money.acquire() # setup step: the buyer starts with money
    print('Buyer has money, ready for exchange')
    b.wait() # don't proceed until everyone is set up
    print('Buyer: "Give me the item first and I promise to pay you"')
    print('(buyer waiting for item)')
    item.acquire()
    print('(buyer has item)')
    print('Buyer: "Thanks for the item; here is the money"')
    money.release()
    print('(buyer no longer has money)')

def seller(item, money, b):
    item.acquire() # setup step: the seller starts with the item
    print('Seller has item, ready for exchange')
    b.wait() # don't proceed until everyone is set up
    print('Seller: "Pay me first and I promise to give you the item"')
    print('(seller waiting for money)')
    money.acquire()
    print('(seller has been paid)')
    print('Seller: "Thanks for the money; here is the item"')
    item.release()
    print('(seller no longer has item)')

ts = [Thread(target=buyer, args=(item,money,b)), Thread(target=seller, args=(item,money,b))]
for t in ts: t.start()
for t in ts: t.join()
print('Exchange finished')

Dining Philosophers

By far the most famous example of deadlock is the dining philosophers problem, which involves

  • A circular table with nn seats
  • nn chopsticks, one between each seat
  • nn diners, each running the following process in a loop:
    1. pick up the chop stick on one side
    2. pick up the chop stick on the other side
    3. eat a little
    4. put the chop sticks back down
    5. think a little
Code implementing Dining Philosophers
from threading import Thread, Lock
from time import sleep
from random import uniform

class Chopstick:
    """Wraps a Lock with a name for easier display"""
    def __init__(self, name):
        self.lock = Lock()
        self.name = name
    def __enter__(self, *args, **kargs): self.lock.__enter__(*args, **kargs)
    def __exit__(self, *args, **kargs): self.lock.__exit__(*args, **kargs)

class Philosopher(Thread):
    def __init__(self, name, left, right):
        super().__init__()
        self.left = left
        self.right = right
        self.name = name
        self.round = 0
    def run(self):
        while True:
            with self.left:
                print(self.name,'has', self.left.name)
                with self.right:
                    print(self.name,'has', self.right.name)
                    sleep(uniform(0,0.001)) # eat
                    self.round += 1
            print(self.name,f'put down the chopsticks ({self.round})')
            sleep(uniform(0,0.001)) # think

count = 6 # number of philosphers
utensil = [Chopstick(f'chopstick {i+1}') for i in range(count)]
seat = [Philosopher(f'Philosopher {i+1}', utensil[i-1], utensil[i]) for i in range(count)]
for p in seat: p.start()

The deadlock occurs if every diner has executed step 1, having one chop stick in hand, and is waiting for a neighboring diner to execute step 4 so they can continue with step 2.

Sometimes the resources that are being waited on are not as obvious as a lock o condition variable. Functions like read and recv will wait for the file descriptor they are reading from to either provide data or declare that no data will arrive. Threads and processes sometimes wait for one another to terminate.

Web browsers typically

  1. open a socket to a server
  2. send a request
  3. wait for and parse a response
  4. if the response suggests more parts are needed to display the page (images, for example), return to step 2 for each such part
  5. close the socket

Web servers typically

  1. accept a socket from a client
  2. read a request
  3. send a response
  4. if the socket hasn’t closed yet, return to step 2

If a client sends a malformed, too-short request (such as sending Content-Length: 10 but only sending 9 bytes in the message body) or the server is misconfigured to read more than it should (such as trying to read more bytes after reaching the end of the request) then the server could get stuck waiting for input in its step 2 while the client is stuck waiting for a response in its step 3.

In this case, the resource being held is the ability to send network packets and the resource being waited on is the arrival of more network packets. It’s still deadlock, even though there are no traditional synchronization primitives involved.

Because neither web servers nor web browsers want the other to be able to freeze them permanently, they generally add some kind of time-out to the waiting, eventually breaking the Coffman conditions and the deadlock by removing the waiting.

1.4 Livelock

Livelock most often occurs when trying to remove deadlock by having threads not wait, instead trying something else when the first thing doesn’t work. It is characterized by a group of threads all doing something, but not making progress because those things are not aligned with what the other threads are doing.

Livelock is quite rare compared to deadlock and primarily of concern when trying to overcome a design that experiences deadlock.

You’ve likely experienced brief livelock when you go to pass someone on one side but they step that way too, then you both step the other way, then back again.

Because you are humans and not identical, eventually one of you made a choice different than the other, but if you had been identical machines this could have continued forever: you try the north but so do they, so then you try the south but so do they, and so on forever.

1.5 Starvation

Starvation occurs when the synchronization design means that one or more threads never make progress even though others do. Starvation is particularly of interest to the synchronization designer, which needs to make sure that every thread waiting for a lock gets a turn instead of always awarding it to the same few threads. User-level starvation most commonly occurs with asymmetric workloads when threads asking for only a few resources get them as soon as they become available, causing a thread asking for many resources to never be able to get all of those it needs.

Both deadlock and livelock can create starvation, but starvation can also occur without either one.

A person who wishes to buy a house, requiring a large down payment, but also wishes to buy many cheaper items experiences starvation, never being able to buy the house, if their financial model is when you see something you want and can afford, buy it. The cheaper items will consume all available money and they’ll never accrue enough to buy the house.

2 Examples

2.1 Lock ordering

Deadlock requires a cycle in the waiting-for relation. Such a cycle can never occur if

  1. we impose a total order on locks; and
  2. every thread that needs to hold multiple locks simultaneously acquires them from least to greatest according to that total order.

If every dining philosopher always takes the chopstick to their left before the chopstick to their right, deadlock can occur; that’s not a total order.

If the chopsticks are numbered 11 to nn and every dining philosopher takes the neighboring chopstick with the lower number first, deadlock cannot occur; that is a total order.

2.2 Lock the locks

Sometimes deadlock can be bypassed by adding more locks. A simple example of this adds a lock-acquiring lock, and requires threads to operate as follows:

  1. acquire the lock-acquiring lock
  2. acquire any other locks you need
  3. release the lock-acquiring lock
  4. do work
  5. release all acquired locks

Because only one thread can be acquiring locks at a time, no cycles can occur. This isn’t particularly efficient in general, often requiring threads to hold more locks than they actually need for longer than necessary, but it does remove deadlock.

Other added-lock approaches can also be effective in specific situations, though the lock-acquiring lock is the only one I’ve seen that is general enough to discuss outside of specific applications.

2.3 Message queues

One way to avoid data races, and thus avoid the need for locks and the possibility of deadlock, is to not allow threads to access the same data in the first place. Instead of treating threads like threads, treat them more like processes, giving each its own thread-local data that is not shared between threads. When threads need to communicate, have them do so through explicitly passing messages back and forth, similar to how different processes and different computers communicate through sockets.

When threads chose a message-passing synchronization, they often do so using a special thread-safe queue structure with the understanding that sending messages is putting them in the queue and receiving messages is reading them from the queue. Thread-safe queues are relatively easy to implement as singly-linked lists updated using an atomic operation to avoid needing any locks.

In general, code using message queues is designed with messages in mind; it is rarely straightforward to take lock-based code and refactor it to use queues instead. Queues are particularly useful in a worker-thread model where tasks are added to the queue and idle threads remove a task from the queue and work on it.

Fancier message structures also exist, including publish-subscribe models where a single message reaches all interested threads simultaneously, execution-interrupting events, and bidirectional socket-like channels.

2.4 Optimistic transactions

A transaction is a set of operations that should happen together, atomically and without interruption.

Locks and condition variables allow transactions to ensure they happen together by blocking other threads from working while the transaction is in flight, but those approaches are somewhat pessimistic, preventing concurrent operations even when different concurrent acts would not conflict in practice. They also admit deadlock if not carefully used.

Optimistic transactions take a different approach: instead of locking other threads out and then doing their work uninterrupted they check the starting state, do their work, then check again to ensure no other thread changed anything before finalizing the result. If another thread did change something, they roll back their changes and start over from the changed data state. This approach cannot deadlock, but some implementations can livelock and all can suffer resource starvation.

Optimistic transactions are challenging to design, much more involved than simply wrapping a critical section in a lock. However, when conflicting updates are not the common case, optimistic transactions can be much faster than lock-based thread safety.