Name: Homework #12 Solution
SKU: 4341
Price: 35.00 USD
Availability: InStock

Description

5/5 – (2 votes)

[13 points] Q-Learning

(a) State the Bellman optimality principle as a function of the optimal Q-function Q∗ (s, a), the expected reward function R(s, a, s0 ) and the transition probability P (s0 |s, a), where s is the current state, s0 is the next state and a is the action taken in state s.

Your answer:

(b) In case the transition probability P (s0 |s, a) and the expected reward R(s, a, s0 ) are un- known, a stochastic approach is used to approximate the optimal Q-function. After observing a transition of the form (s, a, r, s0 ), write down the update of the Q-function at the observed state-action pair (s, a) as a function of the learning rate α, the discount factor γ, Q(s, a) and Q(s0 , a0 ).

Your answer:

(d) What is the advantage of using a replay-memory?

Your answer:

(e) Consider a system with two states S1 and S2 and two actions a1 and a2 . You perform actions and observe the rewards and transitions listed below. Each step lists the current state, reward, action and resulting transition as: Si ; R = r; ak : Si → Sj . Perform Q- learning using a learning rate of α = 0.5 and a discount factor of γ = 0.5 for each step by applying the formula from part (b). The Q-table entries are initialized to zero. Fill in the tables below corresponding to the following four transitions. What is the optimal policy after having observed the four transitions?

S1 ; R = −10; a1 : S1 → S1 ii. S1 ; R = −10; a2 : S1 → S2 iii. S2 ; R = 18.5; a1 : S2 → S1 iv. S1 ; R = −10; a2 : S1 → S2

Your answer:

Homework #12 Solution

Description

Related products

Lab 1 Exploring Numerical Error Solution

Project One: Top of Pile Solution

Homework 3: Mbed Setup Solution

Homework 2: When Harry Met Sally Solution

Lab 7: Multithreaded Numerical Integration Solution