Name: Solved-Homework 6 -Solution
SKU: 18257
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Version 2

Instructions.

Homework is due Tuesday, April 30, at 11:59pm; no late homework accepted. Everyone must submit individually at gradescope under hw6 and hw6code.

The \written” submission at hw6 must be typed, and submitted in any format gradescope accepts (to be safe, submit a PDF). You may use L^AT_EX, markdown, google docs, MS word, whatever you like; but it must be typed!

When submitting at hw6, gradescope will ask you to mark out boxes around each of your answers; please do this precisely!

Please make sure your NetID is clear and large on the rst page of the homework.

Your solution must be written in your own words. Please see the course webpage for full academic integrity information. Brie y, you may have high-level discussions with at most 3 classmates, whose NetIDs you should place on the rst page of your solutions, and you should cite any external reference you use; despite all this, your solution must be written in your own words.

We reserve the right to reduce the auto-graded score for hw6code if we detect funny business (e.g., rather than implementing an algorithm, you keep re-submitting the assignment to the auto-grader, eventually completing a binary search for the answers).

There are no regrade requests on hw6code, which is the code auto-grader; however, you can re-submit and re-grade as many times as you like before the deadline! Start early and report any issues on piazza!

Methods and functions in the template and utility code include docstrings to describe the inputs and outputs. The autograder relies on correct implementations of these methods. Follow the docstrings to avoid failing tests.

k-means.

Recall the k-means problem with n data points S = (x_i)ⁿ_i=1 in R^d, k centers ( _i)^k_i=1 in R^d, and corresponding clusters (S_j)^k_j=1. The objective of k-means is then to minimize the cost:

	n
	X_i	k			j^k	2
_S( ₁; : : : ; _k) =	min		x	i		2	:
_S( ₁; : : : ; _k) =	j			i
	=1

In this problem you will develop an alternate formulation of this cost function in terms of pairwise distances.

Let S_z S be the cluster induced by using a particular point z 2 R^d as a center. Then the cost of using any point z as a center is then given by

_S_z (z) = kx zk²:

x2S_z

Let (S_z) be the sample mean of the cluster S_z. Prove that the cost _S_z (z) is equivalent to

_S_z (z) = _S_z ( (S_z)) + jS_zjk (S_z) zk²:

(b) Show that
	1		X
S_j ⁽ j^{) =}					ka bk²:
	2 S		a;b
	j _jj	2		S_j

Conclude that solving the k-means problem is equivalent to solving
k	1
X			X			2

min					_S_j ka bk :
	²j^Sjj _a;b
^S1^;:::;Sk _j=1				2

Solution.

(Your solution here.)

Wasserstein Distance.

Consider two discrete distributions with weights ( _i)ⁿ_i=1 and ( _j)^m_j=1 on points (x_i)ⁿ_i=1 and (z_j)^m_j=1. The Wasserstein distance between these two distributions (let’s call them and ) is

X X

W ( ; ) = max _if(x_i) _jf(z_j):

kfk_Lip 1

i=1 j=1

Suppose n = m and _i = _i = ¹=n, meaning both distributions are uniform. Show that for any permutation of (1; : : : ; n).

W(;)

max

(i)^k

Note that this implies W ( ; ) min max_i kx_i

^z (i)^k.

(b) Choose (( _i; x_i))_iⁿ₌₁ and (( _j; z_j))_j^m₌₁ with m = n so that

0<W( ; )=

min max

(i)^k

0<W( ; )

min max

(i)^k

100

Solution. (Your solution here.)

Boosting.

In this problem we will consider boosting applied to interval classi ers on the real line. An interval classi er has the form h(x) := 1[a x b]; let H denote all such classi ers (meaning for all a b). Boosting therefore outputs a function of the form

m m

X X

g(x) = _jh_j(x) = _j 1[a_j x b_j]:

j=1 j=1

For all parts of this problem let (x_i; y_i)ⁿ_i=1 be a data set of n points x_i 2 R along with associated labels y_i 2 f 1; 1g. Assume that the x_i are in sorted order and distinct, meaning x_i < x_i+1.

(a) Let (q₁

; : : : ; q_n) be

= 1. Show that

any weights on the training set, meaning q

0 and

q_i [2h(x_i) 1 6= y_i]min ⁽

q_i;

2f^X g

q_i⁾ :

min

i=1

1;:::;n

y_i>0

y_i<0

Remark. This calculation is related to the \weak learning assumption” discussed in lecture. The only di erence is these predictors map to f0; 1g, rather than f 1; +1g.

(b) Show that

min

h2H

n	1			n L
^Xi		1[2h(x_i) 1	6= y_i]
				n ^;
=1	n

where L is the length of the longest contiguous subsequence of examples having the same labels, meaning y_j = y_j+1 = = y_j+L ₁ for some j.

(c) Show that there exists an integer m, reals ( ₁; : : : ; _m), and interval classi ers (h₁; : : : ; h_m) with h_j 2 H so that, for every i,

y_i = _jh_j(x_i):

j=1

In other words, that there exists a perfect boosted interval classi er.

Solution. (Your solution here.)

Variational Autoencoders.

In this problem you will implement a Variational Autoencoder (VAE) to model points sampled from an unknown distribution. This will be done by constructing an encoder network and a decoder network. The encoder network f_enc : X R² ! R^h R^h takes as input a point x from the input space and outputs parameters ( ; ) where = log ². The decoder network f_dec : R^h ! R² takes as input a latent vector z N ( ; ²) and outputs an element x^ 2 R² that we would hope is similar to members of the input space X. You will train this model by minimizing the (regularized) empirical risk

1		n
b		^Xi
R_VAE(f) =	n	‘(f_dec(f_enc(x)); x) + KL			N ( (x_i); exp( (x_i)=2)); N (0; I) :
		=1
(a) Let = diag( ²). In your written submission show that							_j² _j²	3
KL N( ; );N(0;I) =			2	²h + _j=1		log _j²	_j² _j²	3	;
			1		h
			1	4	X			5

where KL(p; q) = ^R p(x) ln ^p_q(⁽_x^x₎⁾ dx is the KL divergence between two densities p; q. You may use the fact that the KL-divergence between two h-dimensional normal distributions N ( ₀; ₀); N ( ₁; ₁) is given by

KL(N ( ₀	; ₀);N( ₁	^; 1⁾⁾⁼ ₂		tr( ₁ ¹ ₀) + ( ₁	₀)^> ₁¹( ₁	₀) h + ln _j^j ₀_j^j			:
			1				₁

Use the empirical risk discussed above to implement a VAE in the class VAE. Use ReLU activa-tions between each layer, except on the last layer of the decoder use sigmoid. Use the ADAM optimizer to optimize in the step() function. Make use of the PyTorch library for this. Use torch.optim.Adam(), there is no need to implement it yourself. Please refer to the docstrings in hw6.py for more implementation details.

Implement the fit function using the net.step() function from the VAE class. See the docstrings in hw6.py for more details.

Fit a VAE on the data generated by generate data in hw6 utils.py. Use a learning rate = 0:01,

latent space dimension h = 6, KL-divergence scaling factor = 5 10 ⁵, and train for 8000 iterations. Use least squares as the loss, that is, let ‘(f(x); x^) = kf(x) x^k²₂. Include separate plots of each of the following in your written submission:

Your empirical risk Rb_VAE on the training data vs iteration count;

The data points (x)ⁿ_i=1 along with their encoded and decoded approximations x^ = f_dec(f_enc(x));
The data points (x)ⁿ_i=1 along with their encoded and decoded approximations x^, and n generated points f_dec(z) where z N (0; I).

After you are done training, save your neural network to a le using torch.save(model.cpu().state dict(), “vae.pb”). You will submit this le to the autograder with your code submission.

(e)	What is the di erence between the x^ and f_dec(z) in general? Why are they di erent in the plots?
(f)	Repeat part (d) except this time use L1 as your loss, that is let ‘(f(x); x^) =	k	f(x) x^	k₁	=
	^P_j²₌₁ jx_j x^_jj. Again, be sure to include the plots in your written submission.

Fit a VAE with 2 f1; 0:01; 0:001g and L1 loss on the same data again , but this time only plot

1. from part (d). Discuss your results. Do you expect the VAE to generalize more closely to the true distribution better or worse as you increase ? Out of all of the parameters you tried including 5 10 ⁵, which parameter seems to give the right balance? Be sure to provide a brief justi cation for your choice.

Solution.

Naive Bayes (Extra credit!).

Let X = (X₁; : : : ; X_d) be a vector of d binary random variables whose distribution, labeled by a boolean

function f : f0; 1g^d ! f0; 1g. Naive bayes proceeds by forming estimates of various probabilities, and predicting with

	c	d
^	c	_i^Y c		= x	ij	Y = y):
^	max Pr(Y = y)	Pr(X		= x		Y = y):
f(x) = arg	y	=1	i
		=1

(a) Suppose f(x) = 1	P	d	, and that the various Pr estimates in f^{^} are exact. Show that
		_j=1 ^xj ^d₂
			c

the naive Bayes predictor f^{^}(x) classi es perfectly in this case. For this problem you can assume d is odd.

Hint. Use symmetry arguments to make computing the probabilities easier.

Under the same setup from part(a), construct a boolean function f : f0; 1g³ ! f0; 1g for which naive Bayes will be unable to correctly classify every binary vector x 2 f0; 1g³. Be sure to verify that your construction works.

Solution. (Your solution here.)

Solved-Homework 6 -Solution

Description

Related products

Homework 1 Solution

Lab 5: Vegas Blackjack Solution

Lab 7: Multithreaded Numerical Integration Solution

Homework 2: Path Network Navigation SOlution

Project 5: Subdivision Surfaces