Description

5/5 - (2 votes)
  1. [13 points]  Q-Learning

 

(a)  State  the Bellman optimality principle as a function  of the optimal  Q-function  Q∗ (s, a), the expected  reward  function  R(s, a, s0 ) and the transition probability P (s0 |s, a), where s is the current state,  s0  is the next  state  and a is the action  taken  in state  s.

Your answer:

 

 

 

 

 

 

 

 

 

(b)  In case the  transition probability P (s0 |s, a) and  the  expected  reward  R(s, a, s0 ) are un- known,  a stochastic  approach is used  to  approximate the  optimal  Q-function.    After observing  a transition of the  form (s, a, r, s0 ), write  down the  update  of the  Q-function at the observed state-action pair (s, a) as a function  of the learning  rate  α, the discount factor γ, Q(s, a) and Q(s0 , a0 ).

 

Your answer:

 

 

 

 

 

 

 

 

 

(c)  What  is the advantage of an epsilon-greedy  strategy?

Your answer:

 

 

 

 

 

 

(d)  What  is the advantage of using a replay-memory?

Your answer:

 

(e)  Consider  a system  with  two states  S1  and  S2  and  two actions  a1  and  a2 .  You perform actions and observe the rewards and transitions listed below. Each step lists the current state,  reward,  action  and  resulting  transition as:  Si ; R = r; ak   : Si  → Sj .  Perform  Q- learning  using a learning  rate  of α = 0.5 and a discount factor  of γ = 0.5 for each step by applying  the  formula  from part  (b).  The Q-table  entries  are initialized  to zero.  Fill in the tables below corresponding  to the following four transitions. What  is the optimal policy after having observed the four transitions?

  1. S1 ; R = −10; a1 : S1  → S1 ii.  S1 ; R = −10; a2  : S1  → S2 iii.  S2 ; R = 18.5; a1  : S2  → S1 iv.  S1 ; R = −10; a2  : S1  → S2

 

Q S1 S2   Q S1 S2   Q S1 S2   Q S1 S2
a1 . .   a1 . .   a1 . .   a1 . .
a2 . .   a2 . .   a2 . .   a2 . .

 

Your answer:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2