CS440 Lectures

Processing math: 10%

CS 440/ECE 448
Fall 2019
Margaret Fleck

Lecture 31: Reinforcement Learning 2

Recap

Pieces of an MDP

states s in S
actions a in A
transition probabilities P(s' | s,a)
reward function R(s)
policy $\pi(s)$ returns action

When we're in state s, we command an action $\pi(s)$ . However, our buggy controller may put us into a variety of choices for the next state s', with probabilities given by the transition function P.

Bellman equation for best policy

$U(s) = R(s) + \gamma \max_{a \in A} \sum_{s' \in S} P(s' | s,a)U(s')$

Recap: Value iteration

Recall how we solve the Bellman equation using value iteration. Let $U_t$ be the utility values at iteration step t.

Initialize $U_0(s) = 0$ for all s
For i=0 until values converge, update U using the equation
$U_{i+1}(s) = R(s) + \gamma \max_{a \in A} \sum_{s' \in S} P(s' | s,a) U_i(s')$

This can take the gridworld problem (left below) and produce the utility values (right below).

From the final converged utility values, we can read off a final policy (bottom right) by essentially moving towards the neighbor with the highest utility. Taking into account the probabilistic nature of our actions, we get the equation

$\pi(s) = \text{argmax}_a \sum_{s'} P(s' | s,a) U(s')$

Value iteration eventually converges to the solution. Notice that the optimal utility values are uniquely determined, but there may be one policy consistent with them.

Policy iteration

Policy iteration produces the same solution, but faster. It operates as follows:

Start with an initial guess for policy $\pi$ .
Alternate two steps:
- Policy evaluation: use policy $\pi$ to estimate utility values U
- Policy improvement: use utility values U to calculate a new policy $\pi$

The values for U and $\pi$ are closely coupled. Policy iteration makes the emerging policy values explicit, so they can help guide the process of refining the utility values.

We saw above how to convert a set of utility values into a policy. We use this for the "policy evaluation" step. We still need to understand how to do the first (policy evaluation) step.

Utility from estimated policy

Our Bellman equation (below) finds the value corresponding to the best action we might take in state s.

$U(s) = R(s) + \gamma \max_a \sum_{s'} P(s' | s,a)U(s')$

However, in policy iteration, we already have a draft policy. So we do a similar computation but assuming that we will command action $\pi(s)$ .

$U(s) = R(s) + \gamma \sum_{s'} P(s' | s, \pi(s)) U(s')$

We have one of these equations for each state s. The simplification means that each equation is linear. So we have two options for finding a solution:

linear algebra
a few iterations of value iteration

The value estimation approach is usually faster. We don't need an exact (fully converged) solution, because we'll be repeating this calculation each time we refine our policy $\pi$ .

Asynchromous dynamic programming

One useful weak to solving Markov Decision Process is "asynchronous dynamic programming." In each iteration, it's not necessary to update all states. We can select only certain states for updating. E.g.

states frequently seen in some application (e.g. a game)
states for which the Bellman equation has a large error (i.e. compare values for left and right sides of the equation)

The details can be spelled out in a wide variety of ways.

Intro to reinforcement learning

So far, we've been solving an MDP under the assumption that we started off knowing all details of P (transition probability) and R (reward function). So we were simply finding a closed form for the recursive Bellman equation. Reinforcement learning (RL) involves the same basic model of states and actions, but our learning agent starts off knowing nothing about P and R. It must find out about P and R (as well as U and $\pi$ ) by taking actions and seeing what happens.

Obviously, the reinforcement learner has terrible performance right at the beginning. The goal is to make it eventually improve to a reasonable level of performance. A reinforcement learning agent must start taking actions with incomplete (initially almost no) information, hoping to gradually learn how to perform well. We typically imagine that it does a very long sequence of actions, returning to the same state multiple types. Its performance starts out very bad but gradually improves to a reasonable level.

A reinforcement learner may be

Online: interacting directly with the world or
Offline: interacting with a simulation of some sort.

The latter would be safer for situations where real-world hazards could cause real damage to the robot or the environment.

An important hybrid is "experience replay." In this case, we have an online learner. But instead of forgetting its old experiences, it will

remember old training sequences,
spend some training time replaying these experiences rather than taking new actions in the world

This is a way to make the most use of training experiences that could be limited by practical considerations, e.g. slow, costly, dangerous.

The reinforcement learning loop

A reinforcement learner repeats the following sequence of steps:

take an action
observe the outcome (state and reward)
update internal representation

There are two basic choices for the "internal representation":

Model-based: explicitly estimate values for P(s'|s,a) and R(s)
Model-free: estimate "Q" values, which sidestep the need to estimate P and R

We'll look at model-free learning next lecture.

Model-based RL

A model based learner operates much like a naive Bayes learning algorithm, except that it has to start making decisions as it trains. Initially it would use some default method for picking actions (e.g. random). As it moves around taking actions, it tracks counts for what rewards it got in different states and what state transitions resulted from its commanded actions. Periodically it uses these counts to

Estimate P(S' | s,a) and R(s), and then
Use values for P and R to estimate U(s) and $\pi(s)$

Tansition to using our learned policy $\pi$ .

Adding exploration

This obvious implementation of a model-based learner tends to be risk-averse. Once a clear policy starts to emerge, it has a strong incentive to stick to its familiar strategy rather than exploring the rest of its environment. So it could miss very good possibilities (e.g. large rewards) if it didn't happen to see them early.

To improve performance, we modify our method of selecting actions:

with probability p, pick $\pi(s)$
with probability 1-p, explore

"Explore" could be implemented in various ways, such as

make a uniform random choice among the actions
try actions that we haven't tried "enough" times in the past

The probability of exploring would typically be set high at the start of learning and gradually decreased, allowing the agent to settle into a specific final policy.

The decision about how often to explore must depend on the state s. States that are easy to reach from the starting state(s) end up explored very early, when more distant states may have barely been reached. For each state s, it's important to continue doing significant amounts of exploration until each action has been tried (in s) enough times to have a clear sense of what it does.

AI in action

Kiwibot catches fire

Alexa sound classification