UI logo
CS 440/ECE 448
Margaret Fleck

Reinforcement Learning 3


Pieces of an MDP

When we're in state s, we command an action \( \pi(s) \). However, our buggy controller may put us into a variety of choices for the next state s', with probabilities given by the transition function P.

Bellman equation for optimal policy

\( U(s) = R(s) + \gamma \max_{a \in A} \sum_{s' \in S} P(s' | s,a)U(s') \)

Recap: Value iteration

Recall how we solve the Bellman equation using value iteration. Let \(U_t\) be the utility values at iteration step t.

Then extract the corresponding policy:

\( \pi(s) = \text{argmax}_a \sum_{s'} P(s' | s,a) U(s') \)

Value iteration eventually converges to the solution. Notice that the optimal utility values are uniquely determined, but there may be more than one policy consistent with them.

Policy Iteration

Suppose that we have picked some policy \( \pi \) telling us what move to command in each state. Then the Bellman equation for this fixed policy is simpler because we know exactly what action we'll command:

Bellman equation for a fixed policy: \(\ \ U(s) = R(s) + \gamma \sum_{s'\in S} P(s' | s,\pi(s))U(s') \)

Because the optimal policy is tightly coupled to the correct utility values, we can rephrase our optimization problem as finding the best policy. This is "policy iteration". It produces the same solution as value iteration, but faster.

Specifically, the policy iteration algorithm looks like this:

Policy iteration makes the emerging policy values explicit, so they can help guide the process of refining the utility values.

The policy improvement step is easy. Just use this equation:

\( \pi(s) = \text{argmax}_a \sum_{s'} P(s' | s,a) U(s') \)

We still need to understand how to do the policy evaluation step.

Policy evaluation

Since we have a draft policy \( \pi(s) \) when doing policy evaluation, we have a simplified Bellman equation (below).

\( U(s) = R(s) + \gamma \sum_{s'} P(s' | s, \pi(s)) U(s') \)

We have one of these equations for each state s. The equations are still recursive (like the original Bellman equation) but they are now linear. So have two options for adjusting our utility function:

The value estimation approach is usually faster. We don't need an exact (fully converged) solution, because we'll be repeating this calculation each time we refine our policy \( \pi \).

Asynchronous dynamic programming

One useful weak to solving Markov Decision Process is "asynchronous dynamic programming." In each iteration, it's not necessary to update all states. We can select only certain states for updating. E.g.

The details can be spelled out in a wide variety of ways.