CS 440/ECE 448
Margaret Fleck

Recap

Pieces of an MDP

• states s in S
• actions a in A
• transition probabilities P(s' | s,a)
• reward function R(s)
• policy $$\pi(s)$$ returns action

When we're in state s, we command an action $$\pi(s)$$. However, our buggy controller may put us into a variety of choices for the next state s', with probabilities given by the transition function P.

Bellman equation for optimal policy

$$U(s) = R(s) + \gamma \max_{a \in A} \sum_{s' \in S} P(s' | s,a)U(s')$$

Recap: Value iteration

Recall how we solve the Bellman equation using value iteration. Let $$U_t$$ be the utility values at iteration step t.

• Initialize $$U_0(s) = 0$$, for all states s
• For i=0 until values converge, update U using the equation
$$U_{i+1}(s) = R(s) + \gamma \max_{a \in A} \sum_{s' \in S} P(s' | s,a) U_i(s')$$

Then extract the corresponding policy:

$$\pi(s) = \text{argmax}_a \sum_{s'} P(s' | s,a) U(s')$$

Value iteration eventually converges to the solution. Notice that the optimal utility values are uniquely determined, but there may be more than one policy consistent with them.

Policy Iteration

Suppose that we have picked some policy $$\pi$$ telling us what move to command in each state. Then the Bellman equation for this fixed policy is simpler because we know exactly what action we'll command:

Bellman equation for a fixed policy: $$\ \ U(s) = R(s) + \gamma \sum_{s'\in S} P(s' | s,\pi(s))U(s')$$

Because the optimal policy is tightly coupled to the correct utility values, we can rephrase our optimization problem as finding the best policy. This is "policy iteration". It produces the same solution as value iteration, but faster.

Specifically, the policy iteration algorithm looks like this:

• Start with an initial guess for policy $$\pi$$.
• Alternate two steps:
• Policy evaluation: use policy $$\pi$$ to estimate utility values U
• Policy improvement: use utility values U to calculate a new policy $$\pi$$

Policy iteration makes the emerging policy values explicit, so they can help guide the process of refining the utility values.

The policy improvement step is easy. Just use this equation:

$$\pi(s) = \text{argmax}_a \sum_{s'} P(s' | s,a) U(s')$$

We still need to understand how to do the policy evaluation step.

Policy evaluation

Since we have a draft policy $$\pi(s)$$ when doing policy evaluation, we have a simplified Bellman equation (below).

$$U(s) = R(s) + \gamma \sum_{s'} P(s' | s, \pi(s)) U(s')$$

We have one of these equations for each state s. The equations are still recursive (like the original Bellman equation) but they are now linear. So have two options for adjusting our utility function:

• linear algebra
• a few iterations of value iteration

The value estimation approach is usually faster. We don't need an exact (fully converged) solution, because we'll be repeating this calculation each time we refine our policy $$\pi$$.

Asynchronous dynamic programming

One useful weak to solving Markov Decision Process is "asynchronous dynamic programming." In each iteration, it's not necessary to update all states. We can select only certain states for updating. E.g.

• states frequently seen in some application (e.g. a game)
• states for which the Bellman equation has a large error (i.e. compare values for left and right sides of the equation)

The details can be spelled out in a wide variety of ways.