So far, we've been solving an MDP under the assumption that we started off knowing all details of P (transition probability) and R (reward function). So we were simply finding a closed form for the recursive Bellman equation. Reinforcement learning (RL) involves the same basic model of states and actions, but our learning agent starts off knowing nothing about P and R. It must find out about P and R (as well as U and \(\pi\)) by taking actions and seeing what happens.
Obviously, the reinforcement learner has terrible performance right at the beginning. The goal is to make it eventually improve to a reasonable level of performance. A reinforcement learning agent must start taking actions with incomplete (initially almost no) information, hoping to gradually learn how to perform well. We typically imagine that it does a very long sequence of actions, returning to the same state multiple types. Its performance starts out very bad but gradually improves to a reasonable level.
A reinforcement learner may be
The latter would be safer for situations where real-world hazards could cause real damage to the robot or the environment.
An important hybrid is "experience replay." In this case, we have an online learner. But instead of forgetting its old experiences, it will
This is a way to make the most use of training experiences that could be limited by practical considerations, e.g. slow, costly, dangerous.
A reinforcement learner repeats the following sequence of steps:
There are two basic choices for the "internal representation":
We'll look at model-free learning next lecture.
A model based learner operates much like a naive Bayes learning algorithm, except that it has to start making decisions as it trains. Initially it would use some default method for picking actions (e.g. random). As it moves around taking actions, it tracks counts for what rewards it got in different states and what state transitions resulted from its commanded actions. Periodically it uses these counts to
Gradually transition to using our learned policy \(\pi\).
This obvious implementation of a model-based learner tends to be risk-averse. Once a clear policy starts to emerge, it has a strong incentive to stick to its familiar strategy rather than exploring the rest of its environment. So it could miss very good possibilities (e.g. large rewards) if it didn't happen to see them early.
To improve performance, we modify our method of selecting actions:
"Explore" could be implemented in various ways, such as
The probability of exploring would typically be set high at the start of learning and gradually decreased, allowing the agent to settle into a specific final policy.
The decision about how often to explore must depend on the state s. States that are easy to reach from the starting state(s) end up explored very early, when more distant states may have barely been reached. For each state s, it's important to continue doing significant amounts of exploration until each action has been tried (in s) enough times to have a clear sense of what it does.