Name: Solved--LSTM Gradient --- Homework 4-- Solution
SKU: 17373
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Submission: You must submit your solutions as a PDF through MarkUs. You can produce the le however you like (e.g. LaTeX, Microsoft Word, scanner) as long as it is readable.

Late Submission: MarkUs will remain open until 3 days after the deadline, after which no late submissions will be accepted. The late penalty is 10% per day, rounded up.

Weekly homeworks are individual work. See the Course Information handout¹ for detailed policies.

LSTM Gradient [4pts] Here, you’ll derive the Backprop Through Time equations for the univariate version of the Long-Term Short-Term Memory (LSTM) architecture.

For reference, here are the computations it performs:

i^(t) = (w_ixx^(t) + w_ihh^(t ¹⁾)

f^(t) = (w_fxx^(t) + w_fhh^(t ¹⁾)

o^(t) = (w_oxx^(t) + w_ohh^(t ¹⁾)

g^(t) = tanh(w_gxx^(t) + w_ghh^(t ¹⁾)

_c(t) ₌ _f(t)_c(t 1) ₊ _i(t)_g(t)

h^(t) = o^(t) tanh(c^(t))

(a) [3pts] Derive the Backprop Through Time equations for the activations and the gates:

_h(t) ₌

_c(t) ₌

_g(t) ₌

_o(t) ₌

_f(t) ₌

_i(t) ₌

You don’t need to vectorize anything or factor out any repeated subexpressions.

(b) [1pt] Derive the BPTT equation for the weight w_ix:

w_ix =

(The other weight matrices are basically the same, so we won’t make you write those out.)

[optional, no points] Based on your answers above, explain why the gradient doesn’t explode if the values of the forget gates are very close to 1 and the values of the input and output gates are very close to 0. (Your answer should involve both h^(t) and c^(t).)

http://www.cs.toronto.edu/_~rgrosse/courses/csc421_2019/syllabus.pdf

CSC421/2516 Winter 2019 Homework 4

Multidimensional RNN [3pts] One of the predecessors to the PixelRNN architecture was the multidimensional RNN (MDRNN). This is like the RNNs we discussed in lecture, except that instead of a 1-D sequence, we have a 2-D grid structure. Analogously to how ordinary RNNs have an input vector and a hidden vector for every time step, MDRNNs have an input vector and hidden vector for every grid square. Each hidden unit receives bottom-up connections from the corresponding input square, as well as recurrent connections from its north and west neighbors as follows:

The activations are computed as follows:

_h(i;j) ₌ _W_in>_x(i;j) ₊ _W_W>_h(i 1;j) ₊ _W_N>_h(i;j 1) _:

For simplicity, we assume there are no bias parameters. Suppose the grid is G G, the input dimension is D, and the hidden dimension is H.

1. [1pt] How many weights does this architecture have? How many arithmetic operations are required to compute the hidden activations? (You only need to give big-O, not an exact count.)

1. [1pt] Suppose that in each step, you can compute as many matrix-vector multiplications as you like. How many steps are required to compute the hidden activations? Explain your answer.

1. [1pt] Give one advantage and one disadvantage of an MDRNN compared to a conv net.

Reversibility [3pts] In lecture, we discussed reversible generator architectures, which en-able e cient maximum likelihood training. In this question, we consider another (perhaps surprising) example of a reversible operation: gradient descent with momentum. Suppose the parameter vector (and hence also the velocity vector p) are both D-dimensional. Recall that the updates are as follows:

_p(k+1)	p^(k) rJ ( ^(k))
(k+1)	(k) ₊ _p(k+1)

If we denote s^(k) = ( ^(k); p^(k)), then we can think of the above equations as de ning a function

_s(k+1) ₌ _f(s(k)_).

[1pt] Show how to compute the inverse, s^(k) = f ¹(s^(k+1)).
[2pts] Find the determinant of the Jacobian, i.e.

det @s^(k+1)=@s^(k):

Hint: rst write the Jacobian as a product of two matrices, one for each step of the above algorithm.

Solved–LSTM Gradient — Homework 4– Solution

Description

Related products

Project One: Top of Pile Solution

Mario Level Generation Solution

[solved]Homework 6: Hero Agents-Solution

Project 4: GPU Programming Solution

Project 3A: Ray Tracing Spheres Solution