Thread Safety

1 Things to Avoid

A function or data structure is thread safe if it behaves as expected when accessed concurrently by multiple threads. Generally, this means avoiding each of the non-thread-safe problems listed below.

1.1 Implicit State

The strtok function’s official description includes this:

On the first call to strtok(), the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be NULL.

This suggests that strtok stores the argument from its first call in some kind of hidden internal state and uses that internal state to handle subsequent calls.

What happens if two different threads each call strtok with different initial strings, then call it repeatedly with NULL to parse the rest of those strings? They end up sharing that same hidden state: once the second thread calls it, the first string is lost and the tokens of the second string are split between the two threads in whatever order the strtok calls happen to arrive.

A function is called *reentrant** (or in some cases the less-precise term thread safe) if it does not have any such hidden state. For example strtok_r is a reentrant version of strtok which requires the user to pass in a pointer to a state variable instead of using a hidden global variable to keep track of which string is being tokenized and how that tokenization is proceeding.

1.2 Data Race

In some cases, having one piece of state shared by several threads is the intent. Perhaps threads are communicating by reading and writing values to shared variables, or are working together to count how many of something they find, or are all accessing and updating a shared dictionary or mapping.

A data race1 A data race is the most common type of race condition, and sometimes just called a race condition occurs when one thread updates a variable while another thread is using it.

Suppose you are copying my name from the board into your notes. I wrote it as Dr. Tychonievich and you’ve copied over the first seven characters, Dr. Tych, when I change it to Prof. Luther; you then finish copying the remaining letters, getting Dr. Tychther due to a data race between my changing the board and your copying it letter-by-letter into your notes.

1.3 Deadlock

We avoid data races by synchronizing access to shared data, meaning only one thread accesses it at a time, forcing other threads to wait their turn. If we design this poorly we can get deadlock where each thread is waiting for something that can’t happen until a different thread stops waiting.

Let’s consider deadlock created by the Lock synchronization primitive. Imagine drawing a graph with a directed edge from each lock a thread holds to the lock that thread is waiting to acquire. Deadlock has occurred if and only if that graph contains a cycle.

For fancier synchronization primitives the definition of the graph is a bit more involved, but the same idea hold: deadlock occurs when there’s a cycle in the graph of the waiting-for relation. We call this condition circular wait, and it depends on three other, simpler ideas:

Together these four conditions are known as the Coffman Conditions after Edward Coffman Jr., who published them in 1971.

High-value exchange

Many action movies have a scene where person MM has big-case-of-money, person WW has super-valuable-widget, and both of them want to exchange the two but both think the other will try to steal both. Thus both run a process like

  1. acquire other person’s thing
  2. release my thing to other person

… causing a deadlock: each has one resource and is waiting on the other resource before proceeding.

Movies having an imperative to keep moving, these scenes generally resolve in a few seconds with something that moves the plot forward, but without such plot pressure they could conceivably stay blocked forever.

Dining Philosophers

By far the most famous example of deadlock is the dining philosophers problem, which involves

  • A circular table with nn seats
  • nn chopsticks, one between each seat
  • nn diners, each running the following process in a loop:
    1. pick up the chop stick on one side
    2. pick up the chop stick on the other side
    3. eat a little
    4. put the chop sticks back down
    5. think a little

The dealock occurs if every diner has executed step 1, having one chop stick in hand, and is waiting for a neighboring diner to execute step 4 so they can continue with step 2.

1.4 Livelock

Livelock most often occurs when trying to remove deadlock by having threads not wait, instead trying something else when the first thing doesn’t work. It is characterized by a group of threads all doing something, but not making progress because those things are not aligned with what the other threads are doing.

Livelock is quite rare compared to deadlock and primarily of concern when trying to overcome a design that experiences deadlock.

You’ve likely experienced brief livelock when you go to pass someone on one side but they step that way too, then you both step the other way, then back again.

Because you are humans and not identical, eventually one of you made a choice different than the other, but if you had been identical machines this could have continued forever: you try the north but so do they, so then you try the south but so do they, and so on forever.

1.5 Starvation

Starvation occurs when the synchronization design means that one or more threads never make progress even though others do. Starvation is particularly of interest to the synchronization designer, which needs to make sure that every thread waiting for a lock gets a turn instead of always awarding it to the same few threads. User-level starvation most commonly occurs with asymmetric workloads when threads asking for only a few resources get them as soon as they become available, causing a thread asking for many resources to never be able to get all of those it needs.

Both deadlock and livelock can create starvation, but starvation can also occur without either one.

A person who wishes to buy a house, requiring a large down payment, but also wishes to buy many cheaper items experiences starvation, never being able to buy the house, if their financial model is when you see something you want and can afford, buy it. The cheaper items will consume all available money and they’ll never accrue enough to buy the house.

2 Examples

2.1 Lock ordering

Deadlock requires a cycle in the waiting-for relation. Such a cycle can never occur if

  1. we impose a total order on locks; and
  2. every thread that needs to hold multiple locks simultaneously acquires them from least to greatest according to that total order.

If every dining philosopher always takes the chopstick to their left before the chopstick to their right, deadlock can occur; that’s not a total order.

If the chopsticks are numbered 11 to nn and every dining philosopher takes the neighboring chopstick with the lower number first, deadlock cannot occur; that is a total order.

2.2 Lock the locks

Sometimes deadlock can be bypassed by adding more locks. A simple example of this adds a lock-acquiring lock, and requires threads to operate as follows:

  1. acquire the lock-acquiring lock
  2. acquire any other locks you need
  3. release the lock-acquiring lock
  4. do work
  5. release all acquired locks

Because only one thread can be acquiring locks at a time, no cycles can occur. This isn’t particularly efficient in general, often requiring threads to hold more locks than they actually need for longer than necessary, but it does remove deadlock.

Other added-lock approaches can also be effective in specific situations, though the lock-acquiring lock is the only one I’ve seen that is general enough to discuss outside of specific applications.

2.3 Message queues

One way to avoid data races, and thus avoid the need for locks and the possibility of deadlock, is to not allow threads to access the same data in the first place. Instead of treating threads like threads, treat them more like processes, giving each its own thread-local data that is not shared between threads. When threads need to communicate, have them do so through explicitly passing messages back and forth, similar to how different processes and different computers communicate through sockets.

When threads chose a message-passing synchronization, they often do so using a special thread-safe queue structure with the understanding that sending messages is putting them in the queue and receiving messages is reading them from the queue. Thread-safe queues are relatively easy to implement as singly-linked lists updated using an atomic operation to avoid needing any locks.

In general, code using message queues is designed with messages in mind; it is rarely straightforward to take lock-based code and refactor it to use queues instead. Queues are particularly useful in a worker-thread model where tasks are added to the queue and idle threads remove a task from the queue and work on it.

Fancier message structures also exist, including publish-subscribe models where a single message reaches all interested threads simultaneously, execution-interrupting events, and bidirectional socket-like channels.

2.4 Optimistic transactions

A transaction is a set of operations that should happen together, atomically and without interruption.

Locks and condition variables allow transactions to ensure they happen together by blocking other threads from working while the transaction is in flight, but those approaches are somewhat pessimistic, preventing concurrent operations even when different concurrent acts would not conflict in practice. They also admit deadlock if not carefully used.

Optimistic transactions take a different approach: instead of locking other threads out and then doing their work uninterrupted they check the starting state, do their work, then check again to ensure no other thread changed anything before finalizing the result. If another thread did change something, they roll back their changes and start over from the changed data state. This approach cannot deadlock, but some implementations can livelock and all can suffer resource starvation.

Optimistic transactions are challenging to design, much more involved than simply wrapping a critical section in a lock. However, when conflicting updates are not the common case, optimistic transactions can be much faster than lock-based thread safety.