The first thing you need to do is to download this file: mp11.zip. If you want, you can also download mp11_extra.zip, the extra credit assignment. mp11.zip
has the following content:
submitted.py
: Your homework. Edit, and then submit to Gradescope.mp11_notebook.ipynb
: This is a Jupyter notebook to help you debug. You can completely ignore it if you want, although you might find that it gives you useful instructions.pong.py
: This is a program that plays Pong. If called interactively, it will call the module pong_display.py
to create a display, so that you can play. If told to use a Q-learner, it will call your submitted.py
to do Q-learning.grade.py
: Once your homework seems to be working, you can test it by typing python grade.py
, which will run the tests in tests/tests_visible.py
.tests/test_visible.py
: This file contains about half of the unit tests that Gradescope will run in order to grade your homework. If you can get a perfect score on these tests, then you should also get a perfect score on the additional hidden tests that Gradescope uses.requirements.txt
: This tells you which python packages you need to have installed, in order to run grade.py
. You can install all of those packages by typing pip install -r requirements.txt
or pip3 install -r requirements.txt
.This file (mp11_notebook.ipynb
) will walk you through the whole MP, giving you instructions and debugging tips as you go.
Pong was the first video game produced by Atari. It is a simple game, based on table tennis. Here is a two-person version of the game: https://commons.wikimedia.org/wiki/File:Pong_Game_Test2.gif
We will be playing a one-person version of the game:
The game is pretty simple, but in order to get a better feeling for it, you may want to try playing it yourself. Use the up arrow to move the paddle upward, and the down arrow to move the paddle downward. See how high you can make your score:
!python pong.py
pygame 2.3.0 (SDL 2.24.2, Python 3.10.9) Hello from the pygame community. https://www.pygame.org/contribute.html Completed 0 games, 3 rewards, 1012 frames, score 2, max score 2 Completed 1 games, 4 rewards, 1220 frames, score 0, max score 2
Once you figure out how to use the arrow keys to control your paddle, we hope you will find that the game is not too hard for a human to play. However, for a computer, it's difficult to know: where should the paddle be moved at each time step? In order to see how difficult it is for a computer to play, let's ask the "random" player to play the game.
WARNING: The following line will open a pygame window. The pygame window will be hidden by this window -- in order to see it, you will need to minimize this window. The pygame window will consume a lot of CPU time just waiting for the processor, so in order to kill it, you will need to come back to this window, click on the block below, then click the Jupyter "stop" button (the square button at the top of this window) in order to stop processing.
!python pong.py --player random
pygame 2.3.0 (SDL 2.24.2, Python 3.10.9) Hello from the pygame community. https://www.pygame.org/contribute.html Completed 0 games, 1 rewards, 182 frames, score 0, max score 0 Completed 1 games, 2 rewards, 390 frames, score 0, max score 0 Completed 2 games, 3 rewards, 571 frames, score 0, max score 0 Completed 3 games, 4 rewards, 752 frames, score 0, max score 0 ^C Traceback (most recent call last): File "/Users/jhasegaw/Dropbox/mark/teaching/ece448/ece448labs/spring23/mp11/src/pong.py", line 270, in <module> application.run() File "/Users/jhasegaw/Dropbox/mark/teaching/ece448/ece448labs/spring23/mp11/src/pong.py", line 170, in run self.display.update_display() File "/Users/jhasegaw/Dropbox/mark/teaching/ece448/ece448labs/spring23/mp11/src/pong_display.py", line 65, in update_display self.fps.tick(60) KeyboardInterrupt
The first thing you will do is to create a q_learner
object that can store your learned Q table and your N table (table of exploration counts).
Like any other object-oriented language, python permits you to create new object classes in order to store data that will be needed from time to time. If you are not already very, very familiar with python classes, you might want to study the python class tutorial: https://docs.python.org/3/tutorial/classes.html
Like any other object in python, a q_learner
object is created by calling its name as a function, e.g., my_q_learner=submitted.q_learner()
. Doing so calls the function submitted.q_learner.__init__()
. Let's look at the docstring to see what it should do.
import submitted, importlib
importlib.reload(submitted)
help(submitted.q_learner.__init__)
Help on function __init__ in module submitted: __init__(self, alpha, epsilon, gamma, nfirst, state_cardinality) Create a new q_learner object. Your q_learner object should store the provided values of alpha, epsilon, gamma, and nfirst. It should also create a Q table and an N table. Q[...state..., ...action...] = expected utility of state/action pair. N[...state..., ...action...] = # times state/action has been explored. Both are initialized to all zeros. Up to you: how will you encode the state and action in order to define these two lookup tables? The state will be a list of 5 integers, such that 0 <= state[i] < state_cardinality[i] for 0 <= i < 5. The action will be either -1, 0, or 1. It is up to you to decide how to convert an input state and action into indices that you can use to access your stored Q and N tables. @params: alpha (scalar) - learning rate of the Q-learner epsilon (scalar) - probability of taking a random action gamma (scalar) - discount factor nfirst (scalar) - exploring each state/action pair nfirst times before exploiting state_cardinality (list) - cardinality of each of the quantized state variables @return: None
Write your __init__
function to meet the requirements specified in the docstring. Once you have completed it, the following code should run without errors:
importlib.reload(submitted)
q_learner = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
print(q_learner)
<submitted.q_learner object at 0x7fa5a427bd60>
In order to manage the exploration/exploitation tradeoff, we will be using both "epsilon-first" and "epsilon-greedy" (https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies).
The epsilon-first strategy explores every state/action pair at least nfirst
times before it ever starts to exploit any strategy. Your q_learner
should have a table to keep track of how many times it has explored a state/action pair prior to the start of any exploitation. The method for storing that table is up to you; in order to have some standardized API, therefore, you need to write a method called report_exploration_counts
that returns a list of the three exploration counts for a given state.
importlib.reload(submitted)
help(submitted.q_learner.report_exploration_counts)
Help on function report_exploration_counts in module submitted: report_exploration_counts(self, state) Check to see how many times each action has been explored in this state. @params: state (list of 5 ints): ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return: explored_count (array of 3 ints): number of times that each action has been explored from this state. The mapping from actions to integers is up to you, but there must be three of them.
Write report_exploration_counts
so that it returns a list or array for any given state. Test your code with the following:
importlib.reload(submitted)
q_learner = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
print('This is how many times state [0,0,0,0,0] has been explored so far:')
print(q_learner.report_exploration_counts([0,0,0,0,0]))
print('This is how many times state [9,9,1,1,9] has been explored so far:')
print(q_learner.report_exploration_counts([9,9,1,1,9]))
This is how many times state [0,0,0,0,0] has been explored so far: [0. 0. 0.] This is how many times state [9,9,1,1,9] has been explored so far: [0. 0. 0.]
When your learner first starts learning, it will call the function choose_unexplored_action
to choose an unexplored action. This function should choose a function uniformly at random from the set of unexplored actions in a given state, if there are any:
importlib.reload(submitted)
help(submitted.q_learner.choose_unexplored_action)
Help on function choose_unexplored_action in module submitted: choose_unexplored_action(self, state) Choose an action that has been explored less than nfirst times. If many actions are underexplored, you should choose uniformly from among those actions; don't just choose the first one all the time. @params: state (list of 5 ints): ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return: action (scalar): either -1, or 0, or 1, or None If all actions have been explored at least n_explore times, return None. Otherwise, choose one uniformly at random from those w/count less than n_explore. When you choose an action, you should increment its count in your counter table.
If this has been written correctly, the following block should generate a random sequence of actions. If the next block produces the same action 5 times in a row, that is the wrong result, and the result would be that your code does not pass the autograder.
importlib.reload(submitted)
q_learner = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
Next action: -1 Next action: 0 Next action: 0 Next action: 1 Next action: -1
After all three actions have been explored nfirst
times, the function choose_unexplored_action
should return None
, as shown here:
importlib.reload(submitted)
q_learner = submitted.q_learner(0.05,0.05,0.99,1,[10,10,2,2,10])
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
print('Next action:',q_learner.choose_unexplored_action([9,9,1,1,9]))
Next action: -1 Next action: 1 Next action: 0 Next action: None
The reinforcement learning we are implementing is called Q-learning (https://en.wikipedia.org/wiki/Q-learning).
Q-learning keeps a table $Q[s,a]$ that specifies the expected utility of action $a$ in state $s$. The organization of this table is up to you. In order to have a standard API, the first thing you should implement is a function report_q
with the following docstring:
importlib.reload(submitted)
help(submitted.q_learner.report_q)
Help on function report_q in module submitted: report_q(self, state) Report the current Q values for the given state. @params: state (list of 5 ints): ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return: Q (array of 3 floats): reward plus expected future utility of each of the three actions. The mapping from actions to integers is up to you, but there must be three of them.
When your q_learner
is first initialized, the value of $Q[state,action]$ should be zero for all state/action pairs, thus the report_q
function should return lists of zeros:
importlib.reload(submitted)
q_learner=submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
print('Q[0,0,0,0,0] is now:',q_learner.report_q([0,0,0,0,0]))
print('Q[9,9,1,1,9] is now:',q_learner.report_q([9,9,1,1,9]))
Q[0,0,0,0,0] is now: [0. 0. 0.] Q[9,9,1,1,9] is now: [0. 0. 0.]
There are actually many different Q-learning algorithms available, but when people refer to Q-learning with no modifier, they usually mean the time-difference (TD) algorithm. For example, this is the algorithm that's described on the wikipedia page (https://en.wikipedia.org/wiki/Q-learning). This is the algorithm you will implement for this MP.
In supervised machine learning, the learner tries to imitate a reference label. In reinforcement learning, there is no reference label. Q-learning replaces the reference label with a "local Q" value, which is the utility that was obtained by performing action $a$ in state $s$ one time. It is usually calculated like this:
$$Q_{local}(s_t,a_t) = r_t + \gamma\max_{a_{t+1}}Q(s_{t+1},a_{t+1})$$where $r_t$ is the reward that was achieved by performing action $a_t$ in state $s_t$, $s_{t+1}$ is the state into which the game transitioned, and $a_{t+1}$ is one of the actions that could be performed in that state. $Q_{local}$ is computed by your q_local
function, which has this docstring:
importlib.reload(submitted)
help(submitted.q_learner.q_local)
Help on function q_local in module submitted: q_local(self, reward, newstate) The update to Q estimated from a single step of game play: reward plus gamma times the max of Q[newstate, ...]. @param: reward (scalar float): the reward achieved from the current step of game play. newstate (list of 5 ints): ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return: Q_local (scalar float): the local value of Q
Initially, q_local
should just return the given reward, because initially, all Q values are 0:
importlib.reload(submitted)
q_learner = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,10,10])
print('Q_local(6.25,[9,9,1,1,9]) is currently:',q_learner.q_local(6.25,[9,9,1,1,9]))
Q_local(6.25,[9,9,1,1,9]) is currently: 6.25
Now you can use q_learner.q_local
as the target for q_learner.learn
. The basic algorithm is
Here is the docstring:
importlib.reload(submitted)
help(submitted.q_learner.learn)
Help on function learn in module submitted: learn(self, state, action, reward, newstate) Update the internal Q-table on the basis of an observed state, action, reward, newstate sequence. @params: state: a list of 5 numbers: ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle. action: an integer, one of -1, 0, or +1 reward: a reward; positive for hitting the ball, negative for losing a game newstate: a list of 5 numbers, in the same format as state @return: None
The following block checks a sequence of Q updates:
importlib.reload(submitted)
q_learner = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
q_learner.learn([9,9,1,1,9],-1,6.25,[0,0,0,0,0])
print('Q[9,9,1,1,9] is now',q_learner.report_q([9,9,1,1,9]))
q_learner.learn([9,9,1,1,8],1,3.1,[9,9,1,1,9])
print('Q[9,9,1,1,8] is now',q_learner.report_q([9,9,1,1,8]))
Q[9,9,1,1,9] is now [0. 0. 0.3125] Q[9,9,1,1,8] is now [0. 0.17046875 0. ]
After you've spent a long time training your q_learner
, you will want to save your Q and N tables so that you can reload them later. The format of Q and N is up to you, therefore it's also up to you to write the save
and load
functions. Here are the docstrings:
importlib.reload(submitted)
help(submitted.q_learner.save)
Help on function save in module submitted: save(self, filename) Save your Q and N tables to a file. This can save in any format you like, as long as your "load" function uses the same file format. We recommend numpy.savez, but you can use something else if you prefer. @params: filename (str) - filename to which it should be saved @return: None
importlib.reload(submitted)
help(submitted.q_learner.load)
Help on function load in module submitted: load(self, filename) Load the Q and N tables from a file. This should load from whatever file format your save function used. We recommend numpy.load, but you can use something else if you prefer. @params: filename (str) - filename from which it should be loaded @return: None
These functions can be tested by doing one step of training one q_learner
, then saving its results, then loading them into another q_learner
:
importlib.reload(submitted)
q_learner1 = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
print('Next action:',q_learner1.choose_unexplored_action([9,9,1,1,9]))
q_learner1.learn([9,9,1,1,9],-1,6.25,[0,0,0,0,0])
print('N1[9,9,1,1,8] is now',q_learner1.report_exploration_counts([9,9,1,1,9]))
print('Q1[9,9,1,1,8] is now',q_learner1.report_q([9,9,1,1,9]))
q_learner1.save('test.npz')
q_learner2 = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
print('N2[9,9,1,1,8] starts out as',q_learner2.report_exploration_counts([9,9,1,1,9]))
print('Q2[9,9,1,1,8] starts out as',q_learner2.report_q([9,9,1,1,9]))
q_learner2.load('test.npz')
print('N2[9,9,1,1,8] is now',q_learner2.report_exploration_counts([9,9,1,1,9]))
print('Q2[9,9,1,1,8] is now',q_learner2.report_q([9,9,1,1,9]))
Next action: 0 N1[9,9,1,1,8] is now [1. 0. 0.] Q1[9,9,1,1,8] is now [0. 0. 0.3125] N2[9,9,1,1,8] starts out as [0. 0. 0.] Q2[9,9,1,1,8] starts out as [0. 0. 0.] N2[9,9,1,1,8] is now [1. 0. 0.] Q2[9,9,1,1,8] is now [0. 0. 0.3125]
A reinforcement learner always has to trade off between exploration (choosing an action at random) versus exploitation (choosing the action with the maximum expected utility). Before we worry about that tradeoff, though, let's first make sure that exploitation works.
importlib.reload(submitted)
help(submitted.q_learner.exploit)
Help on function exploit in module submitted: exploit(self, state) Return the action that has the highest Q-value for the current state, and its Q-value. @params: state (list of 5 ints): ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return: action (scalar int): either -1, or 0, or 1. The action that has the highest Q-value. Ties can be broken any way you want. Q (scalar float): The Q-value of the selected action
importlib.reload(submitted)
q_learner1 = submitted.q_learner(0.05,0.05,0.99,5,[10,10,2,2,10])
q_learner1.learn([9,9,1,1,9],-1,6.25,[0,0,0,0,0])
print('Q1[9,9,1,1,9] is now',q_learner1.report_q([9,9,1,1,9]))
print('The best action and Q from state [9,9,1,1,9] are',q_learner1.exploit([9,9,1,1,9]))
Q1[9,9,1,1,9] is now [0. 0. 0.3125] The best action and Q from state [9,9,1,1,9] are (-1, 0.3125)
When your learner decides which action to perform, it should trade off exploration vs. exploitation using both the epsilon-first and the epsilon-greedy strategies:
nfirst
times, then choose one of those actions at random. Otherwise...epsilon
, choose an action at random. Otherwise...importlib.reload(submitted)
help(submitted.q_learner.act)
Help on function act in module submitted: act(self, state) Decide what action to take in the current state. If any action has been taken less than nfirst times, then choose one of those actions, uniformly at random. Otherwise, with probability epsilon, choose an action uniformly at random. Otherwise, choose the action with the best Q(state,action). @params: state: a list of 5 integers: ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return: -1 if the paddle should move upward 0 if the paddle should be stationary 1 if the paddle should move downward
In order to test all three types of action (epsilon-first exploration, epsilon-greedy exploration, and exploitation), let's create a learner with nfirst=1
and epsilon=0.25
, and set it so that the best action from state [9,9,1,1,9]
is -1
. With these settings, a sequence of calls to q_learner.act
should produce the following sequence of actions:
-1
. The remaining 1/4 should be randomly chosen.importlib.reload(submitted)
q_learner=submitted.q_learner(0.05,0.25,0.99,1,[10,10,2,2,10])
q_learner.learn([9,9,1,1,9],-1,6.25,[0,0,0,0,0])
print('An epsilon-first action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-first action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-first action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
print('An epsilon-greedy explore/exploit action:',q_learner.act([9,9,1,1,9]))
An epsilon-first action: 1 An epsilon-first action: -1 An epsilon-first action: 0 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: 1 An epsilon-greedy explore/exploit action: 1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1 An epsilon-greedy explore/exploit action: -1
Now that all of your components work, you can try training your algorithm. Do this by giving your q_learner
as a player to a new pong.PongGame
object. Set visibility=False
so that the PongGame
doesn't create a new window.
import pong, importlib, submitted
importlib.reload(pong)
help(pong.PongGame.__init__)
Help on function __init__ in module pong: __init__(self, ball_speed=4, paddle_speed=8, learner=None, visible=True, state_quantization=[10, 10, 2, 2, 10]) Create a new pong game, with a specified player. @params: ball_speed (scalar int) - average ball speed in pixels/frame paddle_speed (scalar int) - paddle moves 0, +paddle_speed, or -paddle_speed learner - can be None if the player is human. If not None, should be an object of type random_learner, submitted.q_learner, or submitted.deep_q. visible (bool) - should this game have an attached pygame window? state_quantization (list) - if not None, state variables are quantized into integers of these cardinalities before being passed to the learner.
As you can see, we should set visibility=False
so that the PongGame
doesn't create a new window. We should also make sure that the PongGame uses the same state quantization as the learner.
importlib.reload(pong)
importlib.reload(submitted)
state_quantization = [10,10,2,2,10]
q_learner=submitted.q_learner(0.05,0.05,0.99,5,state_quantization)
pong_game = pong.PongGame(learner=q_learner, visible=False, state_quantization=state_quantization)
print(pong_game)
<pong.PongGame object at 0x7fa5a427bca0>
In order to train our learner, we want it to play the game many times. To do that we use the PongGame.run function:
help(pong_game.run)
Help on method run in module pong: run(m_rewards=inf, m_games=inf, m_frames=inf, states=[]) method of pong.PongGame instance Run the game. @param m_frames (scalar int): maximum number of frames to be played m_rewards (scalar int): maximum number of rewards earned (+ or -) m_games (scalar int): maximum number of games states (list): list of states whose Q-values should be returned each state is a list of 5 ints: ball_x, ball_y, ball_vx, ball_vy, paddle_y. These are the (x,y) position of the ball, the (vx,vy) velocity of the ball, and the y-position of the paddle, all quantized. 0 <= state[i] < state_cardinality[i], for all i in [0,4]. @return scores (list): list of scores of all completed games The following will be returned only if the player is q_learning or deep_q. New elements will be added to these lists once/frame if m_frames is specified, else once/reward if m_rewards is specified, else once/game: q_achieved (list): list of the q-values of the moves that were taken q_states (list): list of the q-values of requested states
In order to make sure our learner is learning, let's tell pong_game.run
to output all 3 Q-values of all of the 4000 states in every time step.
To make sure that's not an outrageous amount of data, let's tell it to only output the Q values once/reward, and ask it to only collect 5000 rewards:
states = [[x,y,vx,vy,py] for x in range(10) for y in range(10) for vx in range(2) for vy in range(2) for py in range(10) ]
scores, q_achieved, q_states = pong_game.run(m_rewards=500, states=states)
print('The number of games played was',len(scores))
print('The number of rewards was',len(q_states))
print('The size of each returned Q-matrix was',q_states[0].shape)
Completed 0 games, 1 rewards, 209 frames, score 0, max score 0 Completed 1 games, 2 rewards, 417 frames, score 0, max score 0 Completed 2 games, 4 rewards, 758 frames, score 1, max score 1 Completed 3 games, 7 rewards, 1898 frames, score 2, max score 2 Completed 4 games, 8 rewards, 2152 frames, score 0, max score 2 Completed 5 games, 9 rewards, 2406 frames, score 0, max score 2 Completed 6 games, 10 rewards, 2587 frames, score 0, max score 2 Completed 7 games, 12 rewards, 2992 frames, score 1, max score 2 Completed 8 games, 13 rewards, 3173 frames, score 0, max score 2 Completed 9 games, 14 rewards, 3427 frames, score 0, max score 2 Completed 10 games, 17 rewards, 4312 frames, score 2, max score 2 Completed 11 games, 18 rewards, 4520 frames, score 0, max score 2 Completed 12 games, 19 rewards, 4774 frames, score 0, max score 2 Completed 13 games, 20 rewards, 4955 frames, score 0, max score 2 Completed 14 games, 22 rewards, 5338 frames, score 1, max score 2 Completed 15 games, 23 rewards, 5546 frames, score 0, max score 2 Completed 16 games, 24 rewards, 5800 frames, score 0, max score 2 Completed 17 games, 25 rewards, 6008 frames, score 0, max score 2 Completed 18 games, 26 rewards, 6262 frames, score 0, max score 2 Completed 19 games, 27 rewards, 6470 frames, score 0, max score 2 Completed 20 games, 28 rewards, 6678 frames, score 0, max score 2 Completed 21 games, 29 rewards, 6932 frames, score 0, max score 2 Completed 22 games, 30 rewards, 7140 frames, score 0, max score 2 Completed 23 games, 33 rewards, 7631 frames, score 2, max score 2 Completed 24 games, 34 rewards, 7812 frames, score 0, max score 2 Completed 25 games, 36 rewards, 8346 frames, score 1, max score 2 Completed 26 games, 39 rewards, 9495 frames, score 2, max score 2 Completed 27 games, 40 rewards, 9749 frames, score 0, max score 2 Completed 28 games, 41 rewards, 9957 frames, score 0, max score 2 Completed 29 games, 42 rewards, 10138 frames, score 0, max score 2 Completed 30 games, 43 rewards, 10319 frames, score 0, max score 2 Completed 31 games, 45 rewards, 10782 frames, score 1, max score 2 Completed 32 games, 46 rewards, 11036 frames, score 0, max score 2 Completed 33 games, 51 rewards, 11656 frames, score 4, max score 4 Completed 34 games, 53 rewards, 12153 frames, score 1, max score 4 Completed 35 games, 55 rewards, 12743 frames, score 1, max score 4 Completed 36 games, 56 rewards, 12997 frames, score 0, max score 4 Completed 37 games, 57 rewards, 13205 frames, score 0, max score 4 Completed 38 games, 60 rewards, 13891 frames, score 2, max score 4 Completed 39 games, 61 rewards, 14099 frames, score 0, max score 4 Completed 40 games, 62 rewards, 14353 frames, score 0, max score 4 Completed 41 games, 66 rewards, 15087 frames, score 3, max score 4 Completed 42 games, 67 rewards, 15268 frames, score 0, max score 4 Completed 43 games, 68 rewards, 15449 frames, score 0, max score 4 Completed 44 games, 69 rewards, 15630 frames, score 0, max score 4 Completed 45 games, 71 rewards, 16127 frames, score 1, max score 4 Completed 46 games, 73 rewards, 17093 frames, score 1, max score 4 Completed 47 games, 76 rewards, 17276 frames, score 2, max score 4 Completed 48 games, 77 rewards, 17530 frames, score 0, max score 4 Completed 49 games, 78 rewards, 17738 frames, score 0, max score 4 Completed 50 games, 79 rewards, 17919 frames, score 0, max score 4 Completed 51 games, 80 rewards, 18127 frames, score 0, max score 4 Completed 52 games, 81 rewards, 18381 frames, score 0, max score 4 Completed 53 games, 82 rewards, 18562 frames, score 0, max score 4 Completed 54 games, 83 rewards, 18770 frames, score 0, max score 4 Completed 55 games, 84 rewards, 18978 frames, score 0, max score 4 Completed 56 games, 86 rewards, 19415 frames, score 1, max score 4 Completed 57 games, 91 rewards, 20351 frames, score 4, max score 4 Completed 58 games, 93 rewards, 20885 frames, score 1, max score 4 Completed 59 games, 95 rewards, 21256 frames, score 1, max score 4 Completed 60 games, 96 rewards, 21510 frames, score 0, max score 4 Completed 61 games, 98 rewards, 22007 frames, score 1, max score 4 Completed 62 games, 99 rewards, 22188 frames, score 0, max score 4 Completed 63 games, 100 rewards, 22369 frames, score 0, max score 4 Completed 64 games, 101 rewards, 22577 frames, score 0, max score 4 Completed 65 games, 103 rewards, 22982 frames, score 1, max score 4 Completed 66 games, 104 rewards, 23236 frames, score 0, max score 4 Completed 67 games, 106 rewards, 23724 frames, score 1, max score 4 Completed 68 games, 107 rewards, 23932 frames, score 0, max score 4 Completed 69 games, 108 rewards, 24113 frames, score 0, max score 4 Completed 70 games, 110 rewards, 24591 frames, score 1, max score 4 Completed 71 games, 111 rewards, 24799 frames, score 0, max score 4 Completed 72 games, 113 rewards, 25227 frames, score 1, max score 4 Completed 73 games, 117 rewards, 26236 frames, score 3, max score 4 Completed 74 games, 119 rewards, 26838 frames, score 1, max score 4 Completed 75 games, 120 rewards, 27046 frames, score 0, max score 4 Completed 76 games, 121 rewards, 27227 frames, score 0, max score 4 Completed 77 games, 122 rewards, 27481 frames, score 0, max score 4 Completed 78 games, 124 rewards, 28057 frames, score 1, max score 4 Completed 79 games, 125 rewards, 28311 frames, score 0, max score 4 Completed 80 games, 126 rewards, 28565 frames, score 0, max score 4 Completed 81 games, 127 rewards, 28819 frames, score 0, max score 4 Completed 82 games, 129 rewards, 29395 frames, score 1, max score 4 Completed 83 games, 130 rewards, 29576 frames, score 0, max score 4 Completed 84 games, 132 rewards, 30110 frames, score 1, max score 4 Completed 85 games, 133 rewards, 30364 frames, score 0, max score 4 Completed 86 games, 134 rewards, 30545 frames, score 0, max score 4 Completed 87 games, 136 rewards, 30956 frames, score 1, max score 4 Completed 88 games, 137 rewards, 31137 frames, score 0, max score 4 Completed 89 games, 140 rewards, 32113 frames, score 2, max score 4 Completed 90 games, 141 rewards, 32367 frames, score 0, max score 4 Completed 91 games, 142 rewards, 32575 frames, score 0, max score 4 Completed 92 games, 143 rewards, 32783 frames, score 0, max score 4 Completed 93 games, 144 rewards, 33037 frames, score 0, max score 4 Completed 94 games, 145 rewards, 33291 frames, score 0, max score 4 Completed 95 games, 148 rewards, 34829 frames, score 2, max score 4 Completed 96 games, 149 rewards, 35083 frames, score 0, max score 4 Completed 97 games, 150 rewards, 35291 frames, score 0, max score 4 Completed 98 games, 153 rewards, 35781 frames, score 2, max score 4 Completed 99 games, 154 rewards, 35989 frames, score 0, max score 4 Completed 100 games, 155 rewards, 36170 frames, score 0, max score 4 Completed 101 games, 156 rewards, 36351 frames, score 0, max score 4 Completed 102 games, 158 rewards, 36756 frames, score 1, max score 4 Completed 103 games, 159 rewards, 36964 frames, score 0, max score 4 Completed 104 games, 161 rewards, 37392 frames, score 1, max score 4 Completed 105 games, 163 rewards, 38126 frames, score 1, max score 4 Completed 106 games, 166 rewards, 38806 frames, score 2, max score 4 Completed 107 games, 167 rewards, 39014 frames, score 0, max score 4 Completed 108 games, 168 rewards, 39195 frames, score 0, max score 4 Completed 109 games, 171 rewards, 39943 frames, score 2, max score 4 Completed 110 games, 172 rewards, 40151 frames, score 0, max score 4 Completed 111 games, 173 rewards, 40332 frames, score 0, max score 4 Completed 112 games, 175 rewards, 41066 frames, score 1, max score 4 Completed 113 games, 176 rewards, 41320 frames, score 0, max score 4 Completed 114 games, 178 rewards, 42286 frames, score 1, max score 4 Completed 115 games, 179 rewards, 42494 frames, score 0, max score 4 Completed 116 games, 181 rewards, 43096 frames, score 1, max score 4 Completed 117 games, 185 rewards, 44183 frames, score 3, max score 4 Completed 118 games, 186 rewards, 44437 frames, score 0, max score 4 Completed 119 games, 187 rewards, 44691 frames, score 0, max score 4 Completed 120 games, 188 rewards, 44872 frames, score 0, max score 4 Completed 121 games, 189 rewards, 45053 frames, score 0, max score 4 Completed 122 games, 191 rewards, 45669 frames, score 1, max score 4 Completed 123 games, 192 rewards, 45877 frames, score 0, max score 4 Completed 124 games, 194 rewards, 46282 frames, score 1, max score 4 Completed 125 games, 197 rewards, 46666 frames, score 2, max score 4 Completed 126 games, 198 rewards, 46847 frames, score 0, max score 4 Completed 127 games, 199 rewards, 47101 frames, score 0, max score 4 Completed 128 games, 201 rewards, 47535 frames, score 1, max score 4 Completed 129 games, 204 rewards, 48573 frames, score 2, max score 4 Completed 130 games, 206 rewards, 49001 frames, score 1, max score 4 Completed 131 games, 207 rewards, 49182 frames, score 0, max score 4 Completed 132 games, 208 rewards, 49436 frames, score 0, max score 4 Completed 133 games, 210 rewards, 49970 frames, score 1, max score 4 Completed 134 games, 213 rewards, 50669 frames, score 2, max score 4 Completed 135 games, 214 rewards, 50850 frames, score 0, max score 4 Completed 136 games, 216 rewards, 51338 frames, score 1, max score 4 Completed 137 games, 217 rewards, 51519 frames, score 0, max score 4 Completed 138 games, 220 rewards, 52349 frames, score 2, max score 4 Completed 139 games, 223 rewards, 52896 frames, score 2, max score 4 Completed 140 games, 224 rewards, 53150 frames, score 0, max score 4 Completed 141 games, 226 rewards, 53658 frames, score 1, max score 4 Completed 142 games, 227 rewards, 53912 frames, score 0, max score 4 Completed 143 games, 228 rewards, 54166 frames, score 0, max score 4 Completed 144 games, 230 rewards, 54594 frames, score 1, max score 4 Completed 145 games, 231 rewards, 54848 frames, score 0, max score 4 Completed 146 games, 232 rewards, 55029 frames, score 0, max score 4 Completed 147 games, 233 rewards, 55283 frames, score 0, max score 4 Completed 148 games, 235 rewards, 55885 frames, score 1, max score 4 Completed 149 games, 236 rewards, 56066 frames, score 0, max score 4 Completed 150 games, 237 rewards, 56320 frames, score 0, max score 4 Completed 151 games, 238 rewards, 56528 frames, score 0, max score 4 Completed 152 games, 239 rewards, 56709 frames, score 0, max score 4 Completed 153 games, 240 rewards, 56917 frames, score 0, max score 4 Completed 154 games, 241 rewards, 57125 frames, score 0, max score 4 Completed 155 games, 243 rewards, 58091 frames, score 1, max score 4 Completed 156 games, 244 rewards, 58299 frames, score 0, max score 4 Completed 157 games, 245 rewards, 58553 frames, score 0, max score 4 Completed 158 games, 246 rewards, 58807 frames, score 0, max score 4 Completed 159 games, 247 rewards, 59015 frames, score 0, max score 4 Completed 160 games, 248 rewards, 59269 frames, score 0, max score 4 Completed 161 games, 250 rewards, 59931 frames, score 1, max score 4 Completed 162 games, 252 rewards, 60414 frames, score 1, max score 4 Completed 163 games, 253 rewards, 60622 frames, score 0, max score 4 Completed 164 games, 254 rewards, 60876 frames, score 0, max score 4 Completed 165 games, 256 rewards, 61252 frames, score 1, max score 4 Completed 166 games, 257 rewards, 61460 frames, score 0, max score 4 Completed 167 games, 258 rewards, 61668 frames, score 0, max score 4 Completed 168 games, 261 rewards, 62355 frames, score 2, max score 4 Completed 169 games, 263 rewards, 62720 frames, score 1, max score 4 Completed 170 games, 264 rewards, 62928 frames, score 0, max score 4 Completed 171 games, 265 rewards, 63136 frames, score 0, max score 4 Completed 172 games, 266 rewards, 63344 frames, score 0, max score 4 Completed 173 games, 267 rewards, 63552 frames, score 0, max score 4 Completed 174 games, 268 rewards, 63760 frames, score 0, max score 4 Completed 175 games, 269 rewards, 64014 frames, score 0, max score 4 Completed 176 games, 270 rewards, 64222 frames, score 0, max score 4 Completed 177 games, 271 rewards, 64476 frames, score 0, max score 4 Completed 178 games, 272 rewards, 64730 frames, score 0, max score 4 Completed 179 games, 273 rewards, 64984 frames, score 0, max score 4 Completed 180 games, 274 rewards, 65238 frames, score 0, max score 4 Completed 181 games, 275 rewards, 65446 frames, score 0, max score 4 Completed 182 games, 279 rewards, 66099 frames, score 3, max score 4 Completed 183 games, 280 rewards, 66280 frames, score 0, max score 4 Completed 184 games, 284 rewards, 67394 frames, score 3, max score 4 Completed 185 games, 287 rewards, 67985 frames, score 2, max score 4 Completed 186 games, 288 rewards, 68193 frames, score 0, max score 4 Completed 187 games, 290 rewards, 68783 frames, score 1, max score 4 Completed 188 games, 291 rewards, 69037 frames, score 0, max score 4 Completed 189 games, 292 rewards, 69291 frames, score 0, max score 4 Completed 190 games, 293 rewards, 69472 frames, score 0, max score 4 Completed 191 games, 294 rewards, 69653 frames, score 0, max score 4 Completed 192 games, 296 rewards, 70874 frames, score 1, max score 4 Completed 193 games, 297 rewards, 71055 frames, score 0, max score 4 Completed 194 games, 299 rewards, 71518 frames, score 1, max score 4 Completed 195 games, 301 rewards, 72052 frames, score 1, max score 4 Completed 196 games, 303 rewards, 72393 frames, score 1, max score 4 Completed 197 games, 304 rewards, 72574 frames, score 0, max score 4 Completed 198 games, 306 rewards, 73071 frames, score 1, max score 4 Completed 199 games, 307 rewards, 73252 frames, score 0, max score 4 Completed 200 games, 309 rewards, 73712 frames, score 1, max score 4 Completed 201 games, 310 rewards, 73893 frames, score 0, max score 4 Completed 202 games, 311 rewards, 74074 frames, score 0, max score 4 Completed 203 games, 313 rewards, 74557 frames, score 1, max score 4 Completed 204 games, 314 rewards, 74738 frames, score 0, max score 4 Completed 205 games, 316 rewards, 75136 frames, score 1, max score 4 Completed 206 games, 318 rewards, 75519 frames, score 1, max score 4 Completed 207 games, 320 rewards, 76016 frames, score 1, max score 4 Completed 208 games, 323 rewards, 76698 frames, score 2, max score 4 Completed 209 games, 324 rewards, 76906 frames, score 0, max score 4 Completed 210 games, 325 rewards, 77087 frames, score 0, max score 4 Completed 211 games, 326 rewards, 77295 frames, score 0, max score 4 Completed 212 games, 329 rewards, 77905 frames, score 2, max score 4 Completed 213 games, 330 rewards, 78086 frames, score 0, max score 4 Completed 214 games, 331 rewards, 78340 frames, score 0, max score 4 Completed 215 games, 335 rewards, 79707 frames, score 3, max score 4 Completed 216 games, 337 rewards, 80204 frames, score 1, max score 4 Completed 217 games, 338 rewards, 80412 frames, score 0, max score 4 Completed 218 games, 339 rewards, 80620 frames, score 0, max score 4 Completed 219 games, 340 rewards, 80874 frames, score 0, max score 4 Completed 220 games, 341 rewards, 81082 frames, score 0, max score 4 Completed 221 games, 342 rewards, 81290 frames, score 0, max score 4 Completed 222 games, 343 rewards, 81544 frames, score 0, max score 4 Completed 223 games, 344 rewards, 81752 frames, score 0, max score 4 Completed 224 games, 348 rewards, 82636 frames, score 3, max score 4 Completed 225 games, 349 rewards, 82817 frames, score 0, max score 4 Completed 226 games, 356 rewards, 84231 frames, score 6, max score 6 Completed 227 games, 358 rewards, 84694 frames, score 1, max score 6 Completed 228 games, 360 rewards, 85154 frames, score 1, max score 6 Completed 229 games, 362 rewards, 85688 frames, score 1, max score 6 Completed 230 games, 363 rewards, 85869 frames, score 0, max score 6 Completed 231 games, 364 rewards, 86123 frames, score 0, max score 6 Completed 232 games, 365 rewards, 86377 frames, score 0, max score 6 Completed 233 games, 366 rewards, 86558 frames, score 0, max score 6 Completed 234 games, 367 rewards, 86739 frames, score 0, max score 6 Completed 235 games, 368 rewards, 86947 frames, score 0, max score 6 Completed 236 games, 371 rewards, 87522 frames, score 2, max score 6 Completed 237 games, 373 rewards, 87927 frames, score 1, max score 6 Completed 238 games, 375 rewards, 88481 frames, score 1, max score 6 Completed 239 games, 377 rewards, 88969 frames, score 1, max score 6 Completed 240 games, 378 rewards, 89177 frames, score 0, max score 6 Completed 241 games, 379 rewards, 89385 frames, score 0, max score 6 Completed 242 games, 381 rewards, 89783 frames, score 1, max score 6 Completed 243 games, 382 rewards, 89991 frames, score 0, max score 6 Completed 244 games, 383 rewards, 90172 frames, score 0, max score 6 Completed 245 games, 384 rewards, 90380 frames, score 0, max score 6 Completed 246 games, 385 rewards, 90634 frames, score 0, max score 6 Completed 247 games, 388 rewards, 91360 frames, score 2, max score 6 Completed 248 games, 389 rewards, 91614 frames, score 0, max score 6 Completed 249 games, 393 rewards, 92588 frames, score 3, max score 6 Completed 250 games, 395 rewards, 92988 frames, score 1, max score 6 Completed 251 games, 397 rewards, 93425 frames, score 1, max score 6 Completed 252 games, 398 rewards, 93679 frames, score 0, max score 6 Completed 253 games, 400 rewards, 94341 frames, score 1, max score 6 Completed 254 games, 402 rewards, 94875 frames, score 1, max score 6 Completed 255 games, 403 rewards, 95056 frames, score 0, max score 6 Completed 256 games, 404 rewards, 95310 frames, score 0, max score 6 Completed 257 games, 405 rewards, 95564 frames, score 0, max score 6 Completed 258 games, 406 rewards, 95772 frames, score 0, max score 6 Completed 259 games, 407 rewards, 96026 frames, score 0, max score 6 Completed 260 games, 409 rewards, 96760 frames, score 1, max score 6 Completed 261 games, 410 rewards, 96941 frames, score 0, max score 6 Completed 262 games, 411 rewards, 97122 frames, score 0, max score 6 Completed 263 games, 412 rewards, 97330 frames, score 0, max score 6 Completed 264 games, 414 rewards, 97758 frames, score 1, max score 6 Completed 265 games, 415 rewards, 97939 frames, score 0, max score 6 Completed 266 games, 416 rewards, 98120 frames, score 0, max score 6 Completed 267 games, 417 rewards, 98328 frames, score 0, max score 6 Completed 268 games, 419 rewards, 98695 frames, score 1, max score 6 Completed 269 games, 423 rewards, 99465 frames, score 3, max score 6 Completed 270 games, 424 rewards, 99646 frames, score 0, max score 6 Completed 271 games, 425 rewards, 99854 frames, score 0, max score 6 Completed 272 games, 428 rewards, 100469 frames, score 2, max score 6 Completed 273 games, 429 rewards, 100650 frames, score 0, max score 6 Completed 274 games, 431 rewards, 101147 frames, score 1, max score 6 Completed 275 games, 432 rewards, 101328 frames, score 0, max score 6 Completed 276 games, 437 rewards, 102076 frames, score 4, max score 6 Completed 277 games, 438 rewards, 102284 frames, score 0, max score 6 Completed 278 games, 439 rewards, 102492 frames, score 0, max score 6 Completed 279 games, 440 rewards, 102673 frames, score 0, max score 6 Completed 280 games, 441 rewards, 102927 frames, score 0, max score 6 Completed 281 games, 442 rewards, 103181 frames, score 0, max score 6 Completed 282 games, 443 rewards, 103362 frames, score 0, max score 6 Completed 283 games, 444 rewards, 103543 frames, score 0, max score 6 Completed 284 games, 445 rewards, 103724 frames, score 0, max score 6 Completed 285 games, 446 rewards, 103932 frames, score 0, max score 6 Completed 286 games, 447 rewards, 104113 frames, score 0, max score 6 Completed 287 games, 448 rewards, 104321 frames, score 0, max score 6 Completed 288 games, 449 rewards, 104502 frames, score 0, max score 6 Completed 289 games, 451 rewards, 105104 frames, score 1, max score 6 Completed 290 games, 452 rewards, 105285 frames, score 0, max score 6 Completed 291 games, 453 rewards, 105466 frames, score 0, max score 6 Completed 292 games, 455 rewards, 105849 frames, score 1, max score 6 Completed 293 games, 457 rewards, 106254 frames, score 1, max score 6 Completed 294 games, 458 rewards, 106435 frames, score 0, max score 6 Completed 295 games, 460 rewards, 106841 frames, score 1, max score 6 Completed 296 games, 461 rewards, 107022 frames, score 0, max score 6 Completed 297 games, 462 rewards, 107276 frames, score 0, max score 6 Completed 298 games, 463 rewards, 107457 frames, score 0, max score 6 Completed 299 games, 465 rewards, 108119 frames, score 1, max score 6 Completed 300 games, 467 rewards, 108530 frames, score 1, max score 6 Completed 301 games, 468 rewards, 108784 frames, score 0, max score 6 Completed 302 games, 469 rewards, 108965 frames, score 0, max score 6 Completed 303 games, 470 rewards, 109173 frames, score 0, max score 6 Completed 304 games, 471 rewards, 109381 frames, score 0, max score 6 Completed 305 games, 472 rewards, 109562 frames, score 0, max score 6 Completed 306 games, 473 rewards, 109816 frames, score 0, max score 6 Completed 307 games, 475 rewards, 110184 frames, score 1, max score 6 Completed 308 games, 476 rewards, 110438 frames, score 0, max score 6 Completed 309 games, 477 rewards, 110692 frames, score 0, max score 6 Completed 310 games, 478 rewards, 110873 frames, score 0, max score 6 Completed 311 games, 479 rewards, 111054 frames, score 0, max score 6 Completed 312 games, 480 rewards, 111262 frames, score 0, max score 6 Completed 313 games, 481 rewards, 111470 frames, score 0, max score 6 Completed 314 games, 482 rewards, 111651 frames, score 0, max score 6 Completed 315 games, 483 rewards, 111859 frames, score 0, max score 6 Completed 316 games, 484 rewards, 112040 frames, score 0, max score 6 Completed 317 games, 485 rewards, 112294 frames, score 0, max score 6 Completed 318 games, 487 rewards, 112677 frames, score 1, max score 6 Completed 319 games, 488 rewards, 112858 frames, score 0, max score 6 Completed 320 games, 489 rewards, 113112 frames, score 0, max score 6 Completed 321 games, 490 rewards, 113293 frames, score 0, max score 6 Completed 322 games, 491 rewards, 113474 frames, score 0, max score 6 Completed 323 games, 492 rewards, 113682 frames, score 0, max score 6 Completed 324 games, 493 rewards, 113863 frames, score 0, max score 6 Completed 325 games, 494 rewards, 114044 frames, score 0, max score 6 Completed 326 games, 495 rewards, 114298 frames, score 0, max score 6 Completed 327 games, 496 rewards, 114552 frames, score 0, max score 6 Completed 328 games, 500 rewards, 115532 frames, score 3, max score 6 The number of games played was 330 The number of rewards was 500 The size of each returned Q-matrix was (4000, 3)
The returned value of q_states
is a list of 4000x3 numpy arrays (20 states, 3 actions). The list contains m_rewards
of these. We want to convert it into something that matplotlib can plot.
import numpy as np
Q = np.array([np.reshape(q,-1) for q in q_states])
print('Q is now of shape',Q.shape)
print('the max absolute value of Q is ',np.amax(abs(Q)))
Q is now of shape (500, 12000) the max absolute value of Q is 1.42625
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(14,6),layout='tight')
ax = [ fig.add_subplot(2,1,x) for x in range(1,3) ]
ax[0].plot(np.arange(0,len(q_states)),Q)
ax[0].set_title('Q values of all states')
ax[1].plot(np.arange(0,len(q_states)),q_achieved)
ax[1].set_title('Q values of state achieved at each time')
ax[1].set_ylabel('Reward number')
Text(0, 0.5, 'Reward number')