Your cart is currently empty!
[13 points] Q-Learning (a) State the Bellman optimality principle as a function of the optimal Q-function Q∗ (s, a), the expected reward function R(s, a, s0 ) and the transition probability P (s0 |s, a), where s is the current state, s0 is the next state and a is the action taken in state s. Your answer:…
(a) State the Bellman optimality principle as a function of the optimal Q-function Q∗ (s, a), the expected reward function R(s, a, s0 ) and the transition probability P (s0 |s, a), where s is the current state, s0 is the next state and a is the action taken in state s.
Your answer:
(b) In case the transition probability P (s0 |s, a) and the expected reward R(s, a, s0 ) are un- known, a stochastic approach is used to approximate the optimal Q-function. After observing a transition of the form (s, a, r, s0 ), write down the update of the Q-function at the observed state-action pair (s, a) as a function of the learning rate α, the discount factor γ, Q(s, a) and Q(s0 , a0 ).
Your answer:
(c) What is the advantage of an epsilon-greedy strategy?
Your answer:
(d) What is the advantage of using a replay-memory?
Your answer:
(e) Consider a system with two states S1 and S2 and two actions a1 and a2 . You perform actions and observe the rewards and transitions listed below. Each step lists the current state, reward, action and resulting transition as: Si ; R = r; ak : Si → Sj . Perform Q- learning using a learning rate of α = 0.5 and a discount factor of γ = 0.5 for each step by applying the formula from part (b). The Q-table entries are initialized to zero. Fill in the tables below corresponding to the following four transitions. What is the optimal policy after having observed the four transitions?
Q | S1 | S2 | Q | S1 | S2 | Q | S1 | S2 | Q | S1 | S2 | |||
a1 | . | . | a1 | . | . | a1 | . | . | a1 | . | . | |||
a2 | . | . | a2 | . | . | a2 | . | . | a2 | . | . |
Your answer:
2