\$35.00

## Description

• Introduction

Reinforcement Learning (RL) is the task of learning from interaction to achieve a goal. The learner and the decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the en-vironment. These interact continually, the agent selecting actions and the environment responding to those actions by presenting rewards and new states.

In the rst part of the project, we will learn the optimal policy of an agent navigating in a 2-D environment. We will implement the Value iteration algo-rithm to learn the optimal policy.

Inverse Reinforcement Learning (IRL) is the task of extracting an expert’s re-ward function by observing the optimal policy of the expert. In the second part of the project, we will explore the application of IRL in the context of apprenticeship learning.

• Reinforcement learning (RL)

The two main objects in Reinforcement learning are:

Agent

Environment

In this project, we will learn the optimal policy of a single agent navigating in a 2-D environment.

2.1 Environment

In this project, we assume that the environment of the agent is modeled by a Markov Decision Process (MDP). In a MDP, agents occupy a state of the environment and perform actions to change the state they are in. After taking an action, they are given some representation of the new state and some reward value associated with the new state.

An MDP formally is a tuple (S; A; Pssa0 ; Rass0 ; ) where:

S is a set of states

A is a set of actions

Pssa0 is a set of transition probabilities, where Pssa0 is the probability of transitioning from state s 2 S to state s0 2 S after taking action a 2 A

{ Pssa0 = P(st+1 = s0jst = s; at = a)

Given any current state and action, s and a, together with any next state, s0, the expected value of the next reward is Rass0

{ Rass0 = E(rt+1jst = s; at = a; st+1 = s0)

2 [0; 1) is the discount factor, and it is used to compute the present value of future reward

{ If is close to 1 then the future rewards are discounted less { If is close to 0 then the future rewards are discounted more

In the next few subsections, we will discuss the parameters that will be used to generate the environment for the project.

2.1.1 State space

In this project, we consider the state space to be a 2-D square grid with 100 states. The 2-D square grid along with the numbering of the states is shown in gure 1

Figure 1: 2-D square grid with state numbering

2.1.2 Action set

In this project, we consider the action set(A) to contain the 4 following actions:

Move Right Move Left

Move Up Move Down

The 4 types of actions are displayed in gure 2

Figure 2: 4 types of action

From the above gure, we can see that the agent can take 4 actions from the state marked with a dot.

2.1.3 Transition probabilities

In this project, we de ne the transition probabilities in the following manner:

1. If state s0 and s are not neighboring states in the 2-D grid, then

P(st+1 = s0jst = s; at = a) = 0

s0 and s are neighbors in the 2-D grid if you can move to s0 from s by taking an action a from the action set A. We will consider a state s to be a neighbor of itself. For example, from gure 1 we can observe that states 1 and 11 are neighbors (we can transition from 1 to 11 by moving right) but states 1 and 12 are not neighbors.

1. Each action corresponds to a movement in the intended direction with

probability 1 w, but has a probability of w of moving in a random direction instead due to wind. To illustrate this, let’s consider the states shown in gure 3

Figure 3: Inner grid states (Non-boundary states)

The transition probabilities for the non-boundary states shown in gure 3 are given below:

w

P(st+1 = 43jst = 44; at =”) = 1 w + 4

w

P(st+1 = 34jst = 44; at =”) = 4

w

P(st+1 = 54jst = 44; at =”) = 4

w

P(st+1 = 45jst = 44; at =”) = 4

From the above calculation it can be observed that if the agent is at a non-boundary state then it has 4 neighbors excluding itself and the probability w is uniformly distributed over the 4 neighbors. Also, if the agent is at a non-boundary state then it transitions to a new state after taking an action (P(st+1 = 44jst = 44; at =”) = 0)

1. If the agent is at one of the four corner states (0,9,90,99), the agent stays at the current state if it takes an action to move o the grid or is blown o the grid by wind. The actions can be divided into two categories:

Action to move o the grid Action to stay in the grid

To illustrate this, let’s consider the states shown in gure 4

Figure 4: Corner states

The transition probabilities for taking an action to move o the grid are given below:

w

P(st+1 = 10jst = 0; at =”) = 4

w

P(st+1 = 1jst = 0; at =”) = 4

w w

P(st+1 = 0jst = 0; at =”) = 1 w + +

4 4

The transition probabilities for taking an action to stay in the grid are given below:

w

P(st+1 = 10jst = 0; at =!) = 1 w + 4

w

P(st+1 = 1jst = 0; at =!) = 4

w w

P(st+1 = 0jst = 0; at =!) = +

4 4

At a corner state, you can be blown o the grid in two directions. As a result, we have P(st+1 = 0jst = 0; at =!) = w4 + w4 since we can be blown o the grid in two directions and in both the cases we stay at the current state.

1. If the agent is at one of the edge states, the agent stays at the current state if it takes an action to move o the grid or is blown o the grid by wind. The actions can be divided into two categories:

Action to move o the grid Action to stay in the grid

To illustrate this, let’s consider the states shown in gure 5

Figure 5: Edge states

The transition probabilities for taking an action to move o the grid are given below:

 P(st+1 = 0jst = 1; at = ) = w 4 P(st+1 = 11jst = 1; at = ) = w 4 P(st+1 = 2jst = 1; at = ) = w 4 w P(st+1 = 1jst = 1; at = ) = 1 w + 4

The transition probabilities for taking an action to stay in the grid are given below:

w

P(st+1 = 0jst = 1; at =”) = 1 w + 4

w

P(st+1 = 11jst = 1; at =”) = 4

w

P(st+1 = 2jst = 1; at =”) = 4

w

P(st+1 = 1jst = 1; at =”) = 4

At an edge state, you can be blown o the grid in one direction. As a result, we have P(st+1 = 1jst = 1; at =”) = w4 since we can be blown o the grid in one direction and in that case we stay at the current state.

The main di erence between a corner state and an edge state is that a corner state has 2 neighbors and an edge state has 3 neighbors.

2.1.4 Reward function

To simplify the project, we will assume that the reward function is independent of the current state (s) and the action that you take at the current state (a). To be speci c, reward function only depends on the state that you transition to (s0). With this simpli cation, we have

Rass0 = R(s0)

In this project, we will learn the optimal policy of an agent for two di erent reward functions:

Reward function 1 Reward function 2

The two di erent reward functions are displayed in gures 6 and 7 respectively

Figure 6: Reward function 1

Figure 7: Reward function 2

Question 1: (10 points) For visualization purpose, generate heat maps of Reward function 1 and Reward function 2. For the heat maps, make sure you display the coloring scale. You will have 2 plots for this question

For solving question 1, you might nd the following function useful:

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pcolor.html

• Optimal policy learning using RL algorithms

In this part of the project, we will use reinforcement learning (RL) algorithm to nd the optimal policy. The main steps in RL algorithm are:

Find optimal state-value or action-value

Use the optimal state-value or action-value to determine the deterministic optimal policy

There are a couple of RL algorithms, but we will use the Value iteration algo-rithm since it was discussed in detail in the lecture. We will skip the derivation of the algorithm here because it was covered in the lecture (for the derivation details please refer to the lecture slides on Reinforcement learning). We will just reproduce the algorithm below for the ease of implementation:

1. procedure Value Iteration(Pssa0 , Rass0 , S, A, ):

2: for all s 2 S do . Initialization

1. V (s) 0

1. end for

1. 1

 6: while > do . Estimation 7: 0 8: for all s 2 S do 9: v V (s); P 10: V (s) max a [ a + V (s )]; a2A s0 2S Pss0 Rss0 0
1. max( ; jv V (s)j);

1. end for

1. end while

 14: for all s 2 S do P a a . Computation (s) arg a2A s0 2S Pss0 [ Rss0 0 15: max + V (s )];
1. end for

1. end procedure return

Question 2: (40 points) Create the environment of the agent using the in-formation provided in section 2. To be speci c, create the MDP by setting up the state-space, action set, transition probabilities, discount factor, and reward function. For creating the environment, use the following set of parameters:

Number of states = 100 (state space is a 10 by 10 square grid as displayed in gure 1)

Number of actions = 4 (set of possible actions is displayed in gure 2) w = 0.1

Discount factor = 0.8 Reward function 1

After you have created the environment, then write an optimal state-value func-tion that takes as input the environment of the agent and outputs the optimal value of each state in the grid. For the optimal state-value function, you have to implement the Initialization (lines 2-4) and Estimation (lines 5-13) steps of the Value Iteration algorithm. For the estimation step, use = 0:01. For visu-alization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal value of that state. In this question, you should have 1 plot.

Question 3: (5 points) Generate a heat map of the optimal state values across the 2-D grid. For generating the heat map, you can use the same function pro-vided in the hint earlier (see the hint after question 1).

Question 4: (15 points) Explain the distribution of the optimal state values

across the 2-D grid. (Hint: Use the gure generated in question 3 to explain)

Question 5: (30 points) Implement the computation step of the value iteration algorithm (lines 14-17) to compute the optimal policy of the agent navigating the 2-D state-space. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The optimal actions should be displayed using arrows. Does the optimal policy of the agent match your intuition? Please provide a brief explanation. Is it possible for the agent to compute the optimal action to take at each state by observing the optimal values of it’s neighboring states? In this question, you should have 1 plot.

Question 6: (10 points) Modify the environment of the agent by replacing Re-ward function 1 with Reward function 2. Use the optimal state-value function implemented in question 2 to compute the optimal value of each state in the grid. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal value of that state. In this question, you should have 1 plot.

Question 7: (10 points) Generate a heat map of the optimal state values (found in question 6) across the 2-D grid. For generating the heat map, you can use the same function provided in the hint earlier.

Question 8: (20 points) Explain the distribution of the optimal state values

across the 2-D grid. (Hint: Use the gure generated in question 7 to explain)

Question 9: (20 points) Implement the computation step of the value iteration algorithm (lines 14-17) to compute the optimal policy of the agent navigating the 2-D state-space. For visualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The optimal actions should be displayed using arrows. Does the optimal policy of the agent match your intuition? Please provide a brief explanation. In this question, you should have 1 plot.

• Inverse Reinforcement learning (IRL)

Inverse Reinforcement learning (IRL) is the task of learning an expert’s reward function by observing the optimal behavior of the expert. The motivation for IRL comes from apprenticeship learning. In apprenticeship learning, the goal of the agent is to learn a policy by observing the behavior of an expert. This task can be accomplished in two ways:

1. Learn the policy directly from expert behavior

1. Learn the expert’s reward function and use it to generate the optimal policy

The second way is preferred because the reward function provides a much more parsimonious description of behavior. Reward function, rather than the policy, is the most succinct, robust, and transferable de nition of the task. There-fore, extracting the reward function of an expert would help design more robust agents.

In this part of the project, we will use IRL algorithm to extract the reward function. We will use the optimal policy computed in the previous section as the expert behavior and use the algorithm to extract the reward function of the expert. Then, we will use the extracted reward function to compute the optimal policy of the agent. We will compare the optimal policy of the agent to the optimal policy of the expert and use some similarity metric between the two to measure the performance of the IRL algorithm.

4.1 IRL algorithm

For nite state spaces, there are a couple of IRL algorithms for extracting the reward function:

Linear Programming (LP) formulation Maximum Entropy formulation

Since we covered LP formulation in the lecture and it is the simplest IRL al-gorithm, so we will use the LP formulation in this project. We will skip the derivation of the algorithm (for details on the derivation please refer to the lecture slides) here. The LP formulation of the IRL is given by equation 1

 subject to [(Pa1 (i) Pa(i))(I P Pa1 ) 1R]ti; a a1; i maximize jSj R;ti;ui i=1(tiui) 8 2 A n 8 (1) (Pa1 Pa)(I Pa1 ) 1R 0; 8a 2 A n a1 u R u jRij Rmax; i = 1; 2; ; jSj

In the LP given by equation 1, R is the reward vector (R(i) = R(si)), Pa is the transition probability matrix, is the adjustable penalty coe cient, and ti‘s and ui‘s are the extra optimization variables (please note that u(i) = ui). Use the maximum absolute value of the ground truth reward as Rmax. For the

ease of implementation, we can recast the LP in equation 1 into an equivalent form given by equation 2 using block matrices.

 maximize cT x x (2) subject to Dx 0; 8a 2 A n a1

Question 10: (10 points) Express c; x; D in terms of R; Pa; Pa1 ; ti; u; and

Rmax

4.2 Performance measure

In this project, we use a very simple measure to evaluate the performance of the IRL algorithm. Before we state the performance measure, let’s introduce some notation:

OA(s): Optimal action of the agent at state s

OE(s): Optimal action of the expert at state s

(

1; OA(s) = OE(s)

m(s) =

Then with the above notation, accuracy is given by equation 3

 Accuracy = Ps2S m(s) (3) jSj

Since we are using the optimal policy found in the previous section as the expert behavior, so we will use the optimal policy found in the previous section to ll the OE(s) values. Please note that these values will be di erent depending on whether we used Reward Function 1 or Reward Function 2 to create the envi-ronment.

To compute OA(s), we will solve the linear program given by equation 2 to extract the reward function of the expert. For solving the linear program you can use the LP solver in python (from cvxopt import solvers and then use solvers.lp). Then, we will use the extracted reward function to compute the op-timal policy of the agent using the value iteration algorithm you implemented in the previous section. The optimal policy of the agent found in this manner will be used to ll the OA(s) values. Please note that these values will depend on the adjustable penalty coe cient . We will tune to maximize the accuracy.

Question 11: (30 points) Sweep from 0 to 5 to get 500 evenly spaced val-ues for . For each value of compute OA(s) by following the process described above. For this problem, use the optimal policy of the agent found in question 5 to ll in the OE(s) values. Then use equation 3 to compute the accuracy of the IRL algorithm for this value of . You need to repeat the above process for all 500 values of to get 500 data points. Plot (x-axis) against Accuracy (y-axis). In this question, you should have 1 plot.

Question 12: (5 points) Use the plot in question 11 to compute the value of

for which accuracy is maximum. For future reference we will denote this value as (1)max. Please report (1)max

Question 13: (15 points) For (1)max, generate heat maps of the ground truth reward and the extracted reward. Please note that the ground truth reward is the Reward function 1 and the extracted reward is computed by solving the linear program given by equation 2 with the parameter set to (1)max. In this question, you should have 2 plots.

Question 14: (10 points) Use the extracted reward function computed in ques-tion 13, to compute the optimal values of the states in the 2-D grid. For com-puting the optimal values you need to use the optimal state-value function that you wrote in question 2. For visualization purpose, generate a heat map of the optimal state values across the 2-D grid (similar to the gure generated in question 3). In this question, you should have 1 plot.

Question 15: (10 points) Compare the heat maps of Question 3 and Ques-tion 14 and provide a brief explanation on their similarities and di erences.

Question 16: (10 points) Use the extracted reward function found in question 13 to compute the optimal policy of the agent. For computing the optimal policy of the agent you need to use the function that you wrote in question 5. For vi-sualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The ac-tions should be displayed using arrows. In this question, you should have 1 plot.

Question 17: (10 points) Compare the gures of Question 5 and Question 16 and provide a brief explanation on their similarities and di erences.

Question 18: (30 points) Sweep from 0 to 5 to get 500 evenly spaced val-ues for . For each value of compute OA(s) by following the process described above. For this problem, use the optimal policy of the agent found in question 9 to ll in the OE(s) values. Then use equation 3 to compute the accuracy of the IRL algorithm for this value of . You need to repeat the above process for all 500 values of to get 500 data points. Plot (x-axis) against Accuracy (y-axis). In this question, you should have 1 plot.

Question 19: (5 points) Use the plot in question 18 to compute the value of

for which accuracy is maximum. For future reference we will denote this value as (2)max. Please report (2)max

Question 20: (15 points) For (2)max, generate heat maps of the ground truth reward and the extracted reward. Please note that the ground truth reward is the Reward function 2 and the extracted reward is computed by solving the linear program given by equation 2 with the parameter set to (2)max. In this question, you should have 2 plots.

Question 21: (10 points) Use the extracted reward function computed in ques-

tion 20, to compute the optimal values of the states in the 2-D grid. For com-puting the optimal values you need to use the optimal state-value function that you wrote in question 2. For visualization purpose, generate a heat map of the optimal state values across the 2-D grid (similar to the gure generated in question 7). In this question, you should have 1 plot.

Question 22: (10 points) Compare the heat maps of Question 7 and Ques-tion 21 and provide a brief explanation on their similarities and di erences.

Question 23: (10 points) Use the extracted reward function found in question 20 to compute the optimal policy of the agent. For computing the optimal policy of the agent you need to use the function that you wrote in question 9. For vi-sualization purpose, you should generate a gure similar to that of gure 1 but with the number of state replaced by the optimal action at that state. The ac-tions should be displayed using arrows. In this question, you should have 1 plot.

Question 24: (10 points) Compare the gures of Question 9 and Question 23 and provide a brief explanation on their similarities and di erences.

Question 25: (50 points) From the gure in question 23, you should observe that the optimal policy of the agent has two major discrepancies. Please iden-tify and provide the causes for these two discrepancies. One of the discrepancy can be xed easily by a slight modi cation to the value iteration algorithm. Perform this modi cation and re-run the modi ed value iteration algorithm to compute the optimal policy of the agent. Also, recompute the maximum ac-curacy after this modi cation. Is there a change in maximum accuracy? The second discrepancy is harder to x and is a limitation of the simple IRL algo-rithm. If you can provide a solution to the second discrepancy then we will give you a bonus of 50 points.

• Submission

Please submit a zip le containing your codes and report to CCLE. The zip le should be named as “Project2 UID1 … UIDn.zip” where UIDx are student ID numbers of team members.

13