$35.00
Description

Introduction
Reinforcement Learning (RL) is the task of learning from interaction to achieve a goal. The learner and the decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to those actions by presenting rewards and new states.
In the rst part of the project, we will learn the optimal policy of an agent navigating in a 2D environment. We will implement the Value iteration algorithm to learn the optimal policy.
Inverse Reinforcement Learning (IRL) is the task of extracting an expert’s reward function by observing the optimal policy of the expert. In the second part of the project, we will explore the application of IRL in the context of apprenticeship learning.

Reinforcement learning (RL)
The two main objects in Reinforcement learning are:
Agent
Environment
In this project, we will learn the optimal policy of a single agent navigating in a 2D environment.
2.1 Environment
In this project, we assume that the environment of the agent is modeled by a Markov Decision Process (MDP). In a MDP, agents occupy a state of the environment and perform actions to change the state they are in. After taking an action, they are given some representation of the new state and some reward value associated with the new state.
An MDP formally is a tuple (S; A; P_{ss}^{a}0 ; R^{a}_{ss}0 ; ) where:
S is a set of states
A is a set of actions
P_{ss}^{a}0 is a set of transition probabilities, where P_{ss}^{a}0 is the probability of transitioning from state s 2 S to state s^{0} 2 S after taking action a 2 A
{ P_{ss}^{a}0 = P(s_{t+1} = s^{0}js_{t} = s; a_{t} = a)
Given any current state and action, s and a, together with any next state, s^{0}, the expected value of the next reward is R^{a}_{ss}0
{ R^{a}_{ss}0 = E(r_{t+1}js_{t} = s; a_{t} = a; s_{t+1} = s^{0})
2 [0; 1) is the discount factor, and it is used to compute the present value of future reward
{ If is close to 1 then the future rewards are discounted less { If is close to 0 then the future rewards are discounted more
In the next few subsections, we will discuss the parameters that will be used to generate the environment for the project.
2.1.1 State space
In this project, we consider the state space to be a 2D square grid with 100 states. The 2D square grid along with the numbering of the states is shown in gure 1
Figure 1: 2D square grid with state numbering
2.1.2 Action set
In this project, we consider the action set(A) to contain the 4 following actions:
Move Right Move Left
Move Up Move Down
The 4 types of actions are displayed in gure 2
Figure 2: 4 types of action
From the above gure, we can see that the agent can take 4 actions from the state marked with a dot.
2.1.3 Transition probabilities
In this project, we de ne the transition probabilities in the following manner:
1. If state s^{0} and s are not neighboring states in the 2D grid, then
P(s_{t+1} = s^{0}js_{t} = s; a_{t} = a) = 0
s^{0} and s are neighbors in the 2D grid if you can move to s^{0} from s by taking an action a from the action set A. We will consider a state s to be a neighbor of itself. For example, from gure 1 we can observe that states 1 and 11 are neighbors (we can transition from 1 to 11 by moving right) but states 1 and 12 are not neighbors.

Each action corresponds to a movement in the intended direction with
probability 1 w, but has a probability of w of moving in a random direction instead due to wind. To illustrate this, let’s consider the states shown in gure 3
Figure 3: Inner grid states (Nonboundary states)
The transition probabilities for the nonboundary states shown in gure 3 are given below:
w
P(s_{t+1} = 43js_{t} = 44; a_{t} =”) = 1 w + _{4}
w
P(s_{t+1} = 34js_{t} = 44; a_{t} =”) = _{4}
w
P(s_{t+1} = 54js_{t} = 44; a_{t} =”) = _{4}
w
P(s_{t+1} = 45js_{t} = 44; a_{t} =”) = _{4}
From the above calculation it can be observed that if the agent is at a nonboundary state then it has 4 neighbors excluding itself and the probability w is uniformly distributed over the 4 neighbors. Also, if the agent is at a nonboundary state then it transitions to a new state after taking an action (P(s_{t+1} = 44js_{t} = 44; a_{t} =”) = 0)

If the agent is at one of the four corner states (0,9,90,99), the agent stays at the current state if it takes an action to move o the grid or is blown o the grid by wind. The actions can be divided into two categories:
Action to move o the grid Action to stay in the grid
To illustrate this, let’s consider the states shown in gure 4
Figure 4: Corner states
The transition probabilities for taking an action to move o the grid are given below:
w
P(s_{t+1} = 10js_{t} = 0; a_{t} =”) = _{4}
w
P(s_{t+1} = 1js_{t} = 0; a_{t} =”) = _{4}
w w
^{P(s}t+1 ^{= 0js}t ^{= 0; a}t ^{=”) = 1} ^{w +} ^{+}
_{4} _{4}
The transition probabilities for taking an action to stay in the grid are given below:
w
P(s_{t+1} = 10js_{t} = 0; a_{t} =!) = 1 w + _{4}
w
P(s_{t+1} = 1js_{t} = 0; a_{t} =!) = _{4}
w w
^{P(s}t+1 ^{= 0js}t ^{= 0; a}t ^{=!) =} ^{+}
_{4} _{4}
At a corner state, you can be blown o the grid in two directions. As a result, we have P(s_{t+1} = 0js_{t} = 0; a_{t} =!) = ^{w}_{4} + ^{w}_{4} since we can be blown o the grid in two directions and in both the cases we stay at the current state.

If the agent is at one of the edge states, the agent stays at the current state if it takes an action to move o the grid or is blown o the grid by wind. The actions can be divided into two categories:
Action to move o the grid Action to stay in the grid
To illustrate this, let’s consider the states shown in gure 5
Figure 5: Edge states
The transition probabilities for taking an action to move o the grid are given below:

P(s_{t+1} = 0js_{t} = 1; a_{t} =
) =
w
4
P(s_{t+1} = 11js_{t} = 1; a_{t} =
) =
w
4
P(s_{t+1} = 2js_{t} = 1; a_{t} =
) =
w
4
w
P(s_{t+1} = 1js_{t} = 1; a_{t} =
) = 1 w +
4
The transition probabilities for taking an action to stay in the grid are given below:
w
P(s_{t+1} = 0js_{t} = 1; a_{t} =”) = 1 w + _{4}
w
P(s_{t+1} = 11js_{t} = 1; a_{t} =”) = _{4}
w
P(s_{t+1} = 2js_{t} = 1; a_{t} =”) = _{4}
w
P(s_{t+1} = 1js_{t} = 1; a_{t} =”) = _{4}
At an edge state, you can be blown o the grid in one direction. As a result, we have P(s_{t+1} = 1js_{t} = 1; a_{t} =”) = ^{w}_{4} since we can be blown o the grid in one direction and in that case we stay at the current state.
The main di erence between a corner state and an edge state is that a corner state has 2 neighbors and an edge state has 3 neighbors.
2.1.4 Reward function
To simplify the project, we will assume that the reward function is independent of the current state (s) and the action that you take at the current state (a). To be speci c, reward function only depends on the state that you transition to (s^{0}). With this simpli cation, we have
R^{a}_{ss}0 = R(s^{0})
In this project, we will learn the optimal policy of an agent for two di erent reward functions:
Reward function 1 Reward function 2
The two di erent reward functions are displayed in gures 6 and 7 respectively
Figure 6: Reward function 1
Figure 7: Reward function 2
Question 1: (10 points) For visualization purpose, generate heat maps of Reward function 1 and Reward function 2. For the heat maps, make sure you display the coloring scale. You will have 2 plots for this question
For solving question 1, you might nd the following function useful:
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pcolor.html

Optimal policy learning using RL algorithms
In this part of the project, we will use reinforcement learning (RL) algorithm to nd the optimal policy. The main steps in RL algorithm are:
Find optimal statevalue or actionvalue
Use the optimal statevalue or actionvalue to determine the deterministic optimal policy
There are a couple of RL algorithms, but we will use the Value iteration algorithm since it was discussed in detail in the lecture. We will skip the derivation of the algorithm here because it was covered in the lecture (for the derivation details please refer to the lecture slides on Reinforcement learning). We will just reproduce the algorithm below for the ease of implementation:

procedure Value Iteration(P_{ss}^{a}0 , R^{a}_{ss}0 , S, A, ):
2: for all s 2 S do . Initialization

V (s) 0

end for

1

6:
while > do
. Estimation
7:
0
8:
for all s 2 S do
9:
v V (s);
P
10:
V (s)
max
a
[
a
+ V (s
)];
a2A
s0 2S ^{P}ss^{0}
R_{ss}0
0

max( ; jv V (s)j);

end for

end while

14:
for all s 2 S do
P
a
a
. Computation
(s) arg _{a2A}
s0 2S ^{P}ss^{0}
[
R_{ss}0
0
15:
max
+ V (s )];

end for

end procedure return
Question 2: (40 points) Create the environment of the agent using the information provided in section 2. To be speci c, create the MDP by setting up the statespace, action set, transition probabilities, discount factor, and reward function. For creating the environment, use the following set of parameters:
Number of states = 100 (state space is a 10 by 10 square grid as displayed in gure 1)
Number of actions = 4 (set of possible actions is displayed in gure 2) w = 0.1
Discount factor = 0.8 Reward function 1
After you have created the environment, then write an optimal statevalue function that takes as input the environment of the agent and outputs the optimal value of each state in the grid. For the optimal statevalue function, you have to implement the Initialization (lines 24) and Estimation (lines 513) steps of the Value Iteration algorithm. For the estimation step, use = 0:01. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal value of that state. In this question, you should have 1 plot.
Question 3: (5 points) Generate a heat map of the optimal state values across the 2D grid. For generating the heat map, you can use the same function provided in the hint earlier (see the hint after question 1).
Question 4: (15 points) Explain the distribution of the optimal state values
across the 2D grid. (Hint: Use the gure generated in question 3 to explain)
Question 5: (30 points) Implement the computation step of the value iteration algorithm (lines 1417) to compute the optimal policy of the agent navigating the 2D statespace. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The optimal actions should be displayed using arrows. Does the optimal policy of the agent match your intuition? Please provide a brief explanation. Is it possible for the agent to compute the optimal action to take at each state by observing the optimal values of it’s neighboring states? In this question, you should have 1 plot.
Question 6: (10 points) Modify the environment of the agent by replacing Reward function 1 with Reward function 2. Use the optimal statevalue function implemented in question 2 to compute the optimal value of each state in the grid. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal value of that state. In this question, you should have 1 plot.
Question 7: (10 points) Generate a heat map of the optimal state values (found in question 6) across the 2D grid. For generating the heat map, you can use the same function provided in the hint earlier.
Question 8: (20 points) Explain the distribution of the optimal state values
across the 2D grid. (Hint: Use the gure generated in question 7 to explain)
Question 9: (20 points) Implement the computation step of the value iteration algorithm (lines 1417) to compute the optimal policy of the agent navigating the 2D statespace. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The optimal actions should be displayed using arrows. Does the optimal policy of the agent match your intuition? Please provide a brief explanation. In this question, you should have 1 plot.

Inverse Reinforcement learning (IRL)
Inverse Reinforcement learning (IRL) is the task of learning an expert’s reward function by observing the optimal behavior of the expert. The motivation for IRL comes from apprenticeship learning. In apprenticeship learning, the goal of the agent is to learn a policy by observing the behavior of an expert. This task can be accomplished in two ways:

Learn the policy directly from expert behavior

Learn the expert’s reward function and use it to generate the optimal policy
The second way is preferred because the reward function provides a much more parsimonious description of behavior. Reward function, rather than the policy, is the most succinct, robust, and transferable de nition of the task. Therefore, extracting the reward function of an expert would help design more robust agents.
In this part of the project, we will use IRL algorithm to extract the reward function. We will use the optimal policy computed in the previous section as the expert behavior and use the algorithm to extract the reward function of the expert. Then, we will use the extracted reward function to compute the optimal policy of the agent. We will compare the optimal policy of the agent to the optimal policy of the expert and use some similarity metric between the two to measure the performance of the IRL algorithm.
4.1 IRL algorithm
For nite state spaces, there are a couple of IRL algorithms for extracting the reward function:
Linear Programming (LP) formulation Maximum Entropy formulation
Since we covered LP formulation in the lecture and it is the simplest IRL algorithm, so we will use the LP formulation in this project. We will skip the derivation of the algorithm (for details on the derivation please refer to the lecture slides) here. The LP formulation of the IRL is given by equation 1

subject to [(P_{a}_{1} (i)
P_{a}(i))(I
^{P} _{P}_{a}_{1} _{)} 1_{R]}_{t}_{i}_{; a}
a_{1};
i
maximize
jSj
R;t_{i};u_{i}
_{i=1}^{(t}i^{u}i^{)}
8
2 A n
8
(1)
(P_{a}_{1} P_{a})(I P_{a}_{1} ) ^{1}R 0;
8a 2 A n a_{1}
u R u
jR_{i}j R_{max};
i = 1; 2; ; jSj
In the LP given by equation 1, R is the reward vector (R(i) = R(s_{i})), P_{a} is the transition probability matrix, is the adjustable penalty coe cient, and t_{i}‘s and u_{i}‘s are the extra optimization variables (please note that u(i) = u_{i}). Use the maximum absolute value of the ground truth reward as R_{max}. For the
ease of implementation, we can recast the LP in equation 1 into an equivalent form given by equation 2 using block matrices.

maximize
c^{T} x
x
(2)
subject to
Dx 0; 8a 2 A n a_{1}
Question 10: (10 points) Express c; x; D in terms of R; P_{a}; P_{a}_{1} ; t_{i}; u; and
^{R}max
4.2 Performance measure
In this project, we use a very simple measure to evaluate the performance of the IRL algorithm. Before we state the performance measure, let’s introduce some notation:
O_{A}(s): Optimal action of the agent at state s
O_{E}(s): Optimal action of the expert at state s
(
^{1; O}A^{(s) =} ^{O}E^{(s)}
_{m(s) =}
Then with the above notation, accuracy is given by equation 3

Accuracy =
^{P}_{s2S} m(s)
(3)
jSj
Since we are using the optimal policy found in the previous section as the expert behavior, so we will use the optimal policy found in the previous section to ll the O_{E}(s) values. Please note that these values will be di erent depending on whether we used Reward Function 1 or Reward Function 2 to create the environment.
To compute O_{A}(s), we will solve the linear program given by equation 2 to extract the reward function of the expert. For solving the linear program you can use the LP solver in python (from cvxopt import solvers and then use solvers.lp). Then, we will use the extracted reward function to compute the optimal policy of the agent using the value iteration algorithm you implemented in the previous section. The optimal policy of the agent found in this manner will be used to ll the O_{A}(s) values. Please note that these values will depend on the adjustable penalty coe cient . We will tune to maximize the accuracy.
Question 11: (30 points) Sweep from 0 to 5 to get 500 evenly spaced values for . For each value of compute O_{A}(s) by following the process described above. For this problem, use the optimal policy of the agent found in question 5 to ll in the O_{E}(s) values. Then use equation 3 to compute the accuracy of the IRL algorithm for this value of . You need to repeat the above process for all 500 values of to get 500 data points. Plot (xaxis) against Accuracy (yaxis). In this question, you should have 1 plot.
Question 12: (5 points) Use the plot in question 11 to compute the value of
for which accuracy is maximum. For future reference we will denote this value as ^{(1)}_{max}. Please report ^{(1)}_{max}
Question 13: (15 points) For ^{(1)}_{max}, generate heat maps of the ground truth reward and the extracted reward. Please note that the ground truth reward is the Reward function 1 and the extracted reward is computed by solving the linear program given by equation 2 with the parameter set to ^{(1)}_{max}. In this question, you should have 2 plots.
Question 14: (10 points) Use the extracted reward function computed in question 13, to compute the optimal values of the states in the 2D grid. For computing the optimal values you need to use the optimal statevalue function that you wrote in question 2. For visualization purpose, generate a heat map of the optimal state values across the 2D grid (similar to the gure generated in question 3). In this question, you should have 1 plot.
Question 15: (10 points) Compare the heat maps of Question 3 and Question 14 and provide a brief explanation on their similarities and di erences.
Question 16: (10 points) Use the extracted reward function found in question 13 to compute the optimal policy of the agent. For computing the optimal policy of the agent you need to use the function that you wrote in question 5. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.
Question 17: (10 points) Compare the gures of Question 5 and Question 16 and provide a brief explanation on their similarities and di erences.
Question 18: (30 points) Sweep from 0 to 5 to get 500 evenly spaced values for . For each value of compute O_{A}(s) by following the process described above. For this problem, use the optimal policy of the agent found in question 9 to ll in the O_{E}(s) values. Then use equation 3 to compute the accuracy of the IRL algorithm for this value of . You need to repeat the above process for all 500 values of to get 500 data points. Plot (xaxis) against Accuracy (yaxis). In this question, you should have 1 plot.
Question 19: (5 points) Use the plot in question 18 to compute the value of
for which accuracy is maximum. For future reference we will denote this value as ^{(2)}_{max}. Please report ^{(2)}_{max}
Question 20: (15 points) For ^{(2)}_{max}, generate heat maps of the ground truth reward and the extracted reward. Please note that the ground truth reward is the Reward function 2 and the extracted reward is computed by solving the linear program given by equation 2 with the parameter set to ^{(2)}_{max}. In this question, you should have 2 plots.
Question 21: (10 points) Use the extracted reward function computed in ques
tion 20, to compute the optimal values of the states in the 2D grid. For computing the optimal values you need to use the optimal statevalue function that you wrote in question 2. For visualization purpose, generate a heat map of the optimal state values across the 2D grid (similar to the gure generated in question 7). In this question, you should have 1 plot.
Question 22: (10 points) Compare the heat maps of Question 7 and Question 21 and provide a brief explanation on their similarities and di erences.
Question 23: (10 points) Use the extracted reward function found in question 20 to compute the optimal policy of the agent. For computing the optimal policy of the agent you need to use the function that you wrote in question 9. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.
Question 24: (10 points) Compare the gures of Question 9 and Question 23 and provide a brief explanation on their similarities and di erences.
Question 25: (50 points) From the gure in question 23, you should observe that the optimal policy of the agent has two major discrepancies. Please identify and provide the causes for these two discrepancies. One of the discrepancy can be xed easily by a slight modi cation to the value iteration algorithm. Perform this modi cation and rerun the modi ed value iteration algorithm to compute the optimal policy of the agent. Also, recompute the maximum accuracy after this modi cation. Is there a change in maximum accuracy? The second discrepancy is harder to x and is a limitation of the simple IRL algorithm. If you can provide a solution to the second discrepancy then we will give you a bonus of 50 points.

Submission
Please submit a zip le containing your codes and report to CCLE. The zip le should be named as “Project2 UID1 … UIDn.zip” where UIDx are student ID numbers of team members.
13