Name: Solved--Homework Assignment 3-- Solution
SKU: 18529
Price: 35.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Starting point Your repository will have now a directory ‘homework3/’. Please do not change the name of this repository or the names of any les we have added to it. Please perform a git pull to retrieve these les. You will nd within it:

A python script pegasos.py, which you will be amending by adding code for questions in Sect. 5.

A script pegasos.sh. You will use it to generate output les after you complete pegasos.py.

Dataset: mnist subset.json

Submission Instructions The following will constitute your submission:

The python script pegasos.py, amended with the code you added for Sect. 5. Be sure to commit your changes!

A PDF report named firstname lastname USCID.pdf, which contains your solutions for questions in Sect. 1, Sect. 2, and Sect. 3. For your written report, your answers must be typeset with L^AT_EX and generated as a PDF le. No handwritten submission will be permitted. There are many free integrated L^AT_EX editors that are convenient to use, please Google search them and choose the one(s) you like the most. This http://www.andy-roberts.net/writing/ latex seems like a good tutorial.

A .json les pegasos.json, which will be the output of pegasos.sh.

Collaboration You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.

Algorithmic component

Kernels

[Recommended maximum time spent: 1 hour]

Consider the following kernel function:

k(x; x⁰) =	⁽0;	otherwise⁰	; 8x; x⁰ 2 R^D:	(1)
	1;	if x = x

Q1.1 Prove that this is a valid kernel. You can apply the Mercer’s theorem mentioned in the lectures (lec10.pdf) and assume that the N points fx₁; ; x_N g you pick are distinct (i.e., x_i 6= x_j if i 6= j).

What to submit: No more than 10 lines of proof.

Q1.2 Suppose now you are given a training set f(x_n 2 R^D; y_n 2 R)g^N_n=1 for a regression problem, where x_i 6= x_j if i 6= j. Show that by using this kernel, training a kernel ridge regressor (lec10.pdf) with = 0 will always lead to the training objective of 0|meaning that all the training examples are predicted accurately by the learned regressor. The training objective of kernel ridge regression is as follows

J( )=	1	^TK^TK	y^T K +		K +		1	y^T y;	(2)
	2			2		2

where K 2 R^{N N} is the kernel matrix, y = [y₁; y₂; ; y_N ]^T , and 2 R^N . The learned regressor is

f(x) = [k(x; x₁); k(x; x₂); ; k(x; x_N )] ;

(3)

where = arg min J( )

What to submit: No more than 5 lines of derivations, plus 1 line to show what is.

Q1.3 Although the learned regressor can accurately predict the value of each training example, it does not generalize to the test data. Speci cally, show that for any x with x 6= x_n; 8n = 1; 2; ; N, the predicted value is always 0.

What to submit: No more than 5 lines of derivations.

Support Vector Machines

[Recommended maximum time spent: 2 hours]

Consider the dataset consisting of points (x; y), where x is a real value, and y 2 f 1; 1g is the class label. Let’s start with three points (x₁; y₁) = ( 1; 1); (x₃; y₃) = (0; 1); (x₂; y₂) = (1; 1).

Figure 1: Three data points considered in Q2

Q2.1 Can three points shown in Figure 1, in their current one-dimensional feature space, be per-fectly separated with a linear separator? Why or why not?

What to submit: No more than 3 lines of reasoning.

Q2.2 Now we de ne a simple feature mapping (x) = [x; x²]^T to transform the three points from one- to two-dimensional feature space. Plot the transformed points in the new two-dimensional feature space (use any package you prefer for the plot, e.g., Matplotlib, PowerPoint). Is there a linear decision boundary that can separate the points in this new feature space? Why or why not?

What to submit: The plot and no more than 3 lines of reasoning.

Q2.3 Given the feature mapping (x) = [x; x²]^T , write down the kernel function k(x; x⁰). More-over, write down the 3 3 kernel (or Gram) matrix K based on k(x_i; x_j) of the three data points. Verify that K is a positive semi-de nite (PSD) matrix. You may want to show this by the de nition of PSD matrix: a symmetric N N real matrix M is said to be positive semi-de nite if the scalar z^T Mz is non-negative for every non-zero column vector z of N real numbers.

What to submit: The form of kernel function k, the kernel matrix K, and no more than 10 lines of proof to show that K is PSD.

Q2.4 Now you are going to learn a hard margin SVM classi er on this dataset with k(x_i; x_j). Re-call the primal and dual formulation you learned in lecture 11 (lec11.pdf). Write down the primal and dual formulations of this problem.

What to submit: Primal and dual formulations. Each with no more than 5 lines.

Q2.5 Next, you are going to solve this problem using its dual formulation. Recall the toy example we walked through in lecture 12 (lec12.pdf). You may want to borrow a similar symmetric property to simplify the dual formulation.

What to submit: Solve the dual formulation and report the vector 2 R³, the weight vector w 2 R² and the bias b 2 R. No more than 10 lines of derivation.

Q2.6 Let y^ = w^T (x) + b, where w and b are the weights and the bias you got from the previ-ous question. Draw the decision boundary in the new two-dimensional feature space and circle all support vectors. (Set y^ to 0 to get the decision boundary). Then, draw the decision boundary in the original one-dimensional setting.

What to submit: Two plots: one in the new two-dimensional feature space, and the other in the original one-dimensional feature space.

Adaboost for building up a nonlinear classi er

[Recommended maximum time spent: 2 hour]

In the lectures (lec13.pdf), you learned an algorithm, Adaboost, which constructs a strong (bi-nary) classi er based on iteratively adding one weak (binary) classi er into it. A weak classi er is learned to maximize the weighted training accuracy at the corresponding iteration, and the weight of each training example is updated after every iteration. Algorithm 1 summarizes the algorithm of Adaboost.

Given

H: A set of functions, where h 2 H takes a D-dimensional vector as input and outputs either +1 or 1.

A training set f(x_n 2 R^D; y_n 2 f+1; 1g)g^N_n=1.

Goal Learn F (x) = sign[^P^T_t=1 _tf_t(x)], where f_t 2 H and _t 2 R.

Initialization w₁(n) =			1	; 8n 2 f1; 2; ; Ng.
			N
for t = 1; 2; ; T do						^P_n w_t(n)I[y_n 6= h(x_n)]
(a) nd f_t = arg min_h2H
(b) compute _t = ^P_n w_t(n)I[y_n 6= f_t(x_n)]
1					1	_t
(c) compute _t =		log
	2					t		; 8n 2 f1; 2; ; Ng
(d) compute w_t+1(n) =					⁽w_t(n) exp( _t);		otherwise
						w_t(n) exp( _t);	if y_n = f_t(x_n)

w_t+1(n)

^{(e) normalize} ^wt+1⁽ⁿ⁾ P_n₀ _w_t+1_(n₀₎^; 8ⁿ 2 f^1; ^2; ^{; N}g

end

Algorithm 1: The Adaboost algorithm

Figure 2: The two training sets considered in Q3: (A) for Q3.1, Q3.2, and (B) for Q3.3, Q3.4, Q3.5, Q3.6.

In this question, you are going to experiment with the learning process of Adaboost, with decision stumps as H. A decision stump h 2 H is a classi er characterized by a triplet (s 2 f+1; 1g; b 2 R; d 2 f1; 2; ; Dg) such that

h_(s;b;d)(x) = ⁽	s; if x_d > b;	(4)
	s; otherwise:

That is, each decision stump function only looks at a single dimension/entry x_d of the input vector x, and check whether x_d is larger than b or not (s decides which label to give if x_d > b).

Speci cally, you are given a simple 4-example training set (as illustrated in Fig. 2 (A)):

(x₁	=	1	; y₁ = +1); (x₂ =		1		; y₂ =	1); (x₃ =	1	; y₃ = +1); (x₄ =	1	; y₄ = 1)
		1			1				1		1
For simplicity, let’s consider
			^H ⁼ ^fh(s;b;d)	j s		2 f+1;		1g; b 2 f 2;	0:5; 0:5; 2g; d 2 f1; 2gg:			(5)

That is, H contains only 16 distinct decision stump functions (8 horizontal boundaries and 8 vertical boundaries).

Q3.1 Let’s get started to run Adaboost for T = 3. At t = 1, please nd the best decision stump f₁ (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f₁. What are the corresponding ₁ and ₁ based on f₁?

What to submit: Write down the triplet (s; b; d) of f₁, and the values of ₁ and ₁. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).

Q3.2 From your results in Q3.1 and Algorithm 1, you will observe something interesting about Adaboost. That is, if _t = 0, the corresponding iteration t does not contribute anything to the nal decision function F (x). This might be ne since we can move on to the next iteration and probably get _t+1 6= 0 there. Let’s have a try. Please write down w₂(n); 8n 2 f1; 2; ; N g, and check the objective function to solve at t = 2 (i.e., solve step (a) in Algorithm 1). You will observe that the objective function at t = 2 remains exactly the SAME as at t = 1. The Adaboost algorithm thus will likely get stuck after iteration t = 1.

What to submit: Write down w₂(n); 8n 2 f1; 2; 3; 4g. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).

Q3.3 From the observations in Q3.2, it seems like Adaboost is not really learning something strong. We argue that it is because of the form of decision stumps: each of them can only draw a horizontal or vertical decision boundary in our 2-dimensional case. Can we do any simple trick to solve this issue?

Let’s try one trick that you have learned, feature transform. Speci cally, we apply a simple

linear transformation for every example
	2	p		p		3
		2		2
		2		2
x	6	p	p		7		x;	(6)
	6	2	2			7
	6					7
	6					7
		2	2
	4					5

which results in a transformed training set

(x₁ =

; y₁

= +1); (x₂ =

; y₂ =

1); (x₃ =

; y₃

(x₄ =

; y₄ = 1):

= +1);

0²

From now on, let’s only consider this new training set (as illustrated in Fig. 2 (B)), but use the same H de ned in eq. (5).

Let’s get started again to run Adaboost for T = 3, using the new training set. At t = 1, please nd the best decision stump f₁ (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f₁. What are the corresponding ₁ and ₁ based on f₁?

What to submit: Write down the triplet (s; b; d) of f₁, and the values of ₁ and ₁. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).

Q3.4 Now you should see that Adaboost does learn something at t = 1 (i.e., ₁ 6= 0). Let’s move on to t = 2. Please rst write down w₂(n); 8n 2 f1; 2; 3; 4g for each example. Then, please nd the best decision stump f₂ (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f₂. What are the corresponding ₂ and ₂ based on f₂?

What to submit: Write down w₂(n); 8n 2 f1; 2; 3; 4g, the triplet (s; b; d) of f₂, and the values of ₂ and ₂. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).

Q3.5 Let’s move on to t = 3. Please rst write down w₃(n); 8n 2 f1; 2; 3; 4g for each example. Then, please nd the best decision stump f₃ (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f₃. What are the corresponding ₃ and ₃ based on f₃?

What to submit: Write down w₃(n); 8n 2 f1; 2; 3; 4g, the triplet (s; b; d) of f₃, and the values of ₃ and ₃. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).

Q3.6 Please combine your results of Q3.3, Q3.4, and Q3.5, and write down your learned F (x)

in terms of h_(s;b;d), ₁, ₂, and ₃. You should plug VALUES into s; b; d; ₁; ₂; ₃. For example, F (x) = sign[0:5 h_(+1;0:5;1)(x) + 0:5 h_{( 1;0:5;1)}(x) + 0:5 h_{( 1;2;2)}(x)]. Then please write down the predicted labels of x₁; x₂; x₃; x₄ using F (x). How many training examples are correctly labeled

by F (x)?

What to submit: Write down F (x), and compute F (x₁); F (x₂); F (x₃); F (x₄), and the number of correctly labeled examples. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).

Q3.7 From Q3.6, you should see that Adaboost with decision stumps can solve a problem similar to XOR in 3 iterations, by using a simple linear transformation (more precisely, 2-dimensional rotation). Note that the decision boundary by F (x) is nonlinear, even though that h 2 H can only produce a linear boundary. Also note that, even with the same feature transform, a linear classi er such as logistic regression cannot correctly label every training example in our case.

What to submit: Nothing.

Programming component

High-level descriptions

4.1 Dataset

We will use mnist subset (images of handwritten digits from 0 to 9). This is the same subset of the full MNIST that we used for Homework 1 and Homework 2. As before, the dataset is stored in a JSON-formated le mnist subset.json. You can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load mnist subset.json to the variable x. Then, x[⁰train⁰] refers to the training set of mnist subset. This set is a list with two elements: x[⁰train⁰][0] containing the features of size N (samples) D (dimension of features), and x[⁰train⁰][1] containing the corresponding labels of size N.

4.2 Tasks

You will be asked to implement the linear support vector machine (SVM) for binary classi cation (Sect. 5). Speci cally, you will

nish implementing the following three functions|objective function, pegasos train, and pegasos test|in pegasos.py. Refer to pegasos.py and Sect. 5 for more information.

Run the script pegasos.sh after you nish your implementation. This will output pegasos.json. add, commit, and push (1) the pegasos.py, and (2) the pegasos.json le that you have

created.

As in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you (e.g., we have pre-processed the data to have to class: digits 0{4 and digits 5{9.). For speci c instructions, please refer to text in Sect. 5 and the instructions in pegasos.py.

4.3 Cautions

Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required les, including your code and generated output les.

Pegasos: a stochastic gradient based solver for linear SVM

In this question, you will build a linear SVM classi er using the Pegasos algorithm [1]. Given a training set f(x_n 2 R^D; y_n 2 f1; 1g)g^N_n=1, the primal formulation of linear SVM is as follows

min

w ²

(7)

₂ jj

maxf0; 1

x_ng:

jj₂

N _n

y_nw

Instead of turning it into dual formulation, we are going to solve the primal formulation directly with a gradient-base algorithm. Note that here we include the bias term b into parameter w by appending x with 1.

In (batch) gradient descent, at each iteration of parameter update, we compute the gradients for all data points and take the average (or sum). When the training set is large (i.e., N is a large number), this is often too computationally expensive (for example, a too large chunk of data cannot be held in memory). Stochastic gradient descent with mini-batch alleviates this issue by computing the gradient on a subset of the data at each iteration.

One key issue of using (stochastic) gradient descent to solve eq. (7) is that maxf0; zg is not di erentiable at z = 0. You have seen this issue in Homework 2, where we use the Heaviside function to deal with it. In this question, you are going to learn and implement Pegasos, a representative solver of eq. (7) that applies stochastic gradient descent with mini-batch and explicitly takes care of the non-di erentiable issue.

The pseudocode of Pegasos is given in Algorithm 2. At the t-th iteration of parameter update, we rst take a mini-batch of data A_t of size K [step (a)], and de ne A⁺_t A_t to be the subset of samples for which w_t su ers a non-zero loss [step (b)]. Next we set the learning rate _t = 1=( t) [step (c)]. We then perform a two-step parameter update as follows. We rst compute

(1 t )wt and for all samples (x; y) 2 A⁺ we add the vector ^y ^tx to (1 t )wt. We denote the

t _K

resulting vector by w_t+ 1₂ [step (d)]. This step can also be written as w_t+ 1₂ = w_t _tr_t where

r_t = w_t		1		^X2 _t

	j	A_t	^j (x
		A_t		yx
				;y) A⁺

The de nition of the hinge loss, together with the Heaviside function, implies that r_t is the gradient of the objective function on the mini-batch A_t at w_t. Last, we set w_t+1 to be the projection of w_t+ 1₂ onto the set

B = fw : jjwjj 1= g

^{This is obtained by scaling} ^wt+1 ^{by min}f^1; _jjw_t+ ₁₂ _jjg ^{[step (e)]. For details of Pegasos algorithm} you may refer to the original paper [1].

Input A training set S = f(x_n 2 R^D; y_n 2 f1; 1g)g^N_n=1, the total number of iterations T , the batch size K, and the regularization parameter .

Output The last weight w_T ₊₁.

Initialization Choose w s.t. jjw jj ¹.

1 1 p

for t = 1; 2; ; T do

Randomly choose a subset of data A_t 2 S, where jA_tj = K

Set A⁺_t = f(x; y) 2 A_t : yw_t^T x_t < 1g

(c) Set _t =
		t
(d) Set w_t+				= (1 _t )w_t +		t	^P(x;y)2A_t⁺ ^yx
	1
			K
	2
					p

^Set ^w^t+1 ^{= minf1;} _jjw_t+ ₁₂ _jj^gwt+ ¹₂

end

Algorithm 2: The Pegasos algorithm

Now you are going to implement Pegasos and train a binary linear SVM classi er on the dataset mnist subset.json. You are going to implement three functions|objective function, pegasos train, and pegasos test|in a script named pegasos.py. You will nd detailed pro-gramming instructions in the script.

Q5.1 Finish the implementation of the function objective function, which corresponds to the objective function in the primal formulation of SVM.

Q5.2 Finish the implementation of the function pegasos train, which corresponds to Algorithm

Q5.3 After you train your model, run your classi er on the test set and report the accuracy, which is de ned as:

of correctly classi ed test samples

- of test samples

Finish the implementation of the function pegasos test.

Q5.4 After you complete above steps, run pegasos.sh, which will run the Pegasos algorithm for 500 iterations with 6 settings (mini-batch size K = 100 with di erent 2 f0:01; 0:1; 1g and = 0:1 with di erent K 2 f1; 10; 1000g), and output a pegasos.json that records the test accuracy and the value of objective function at each iteration during the training process.

What to do and submit: run script pegasos.sh. It will generate pegasos.json. Add, commit, and push both pegasos.py and pegasos.json before the due date.

Q5.5 Based on pegasos.json, you are encouraged to make plots of the objective function value versus the number of iterations (i.e., a convergence curve) in di erent settings of and k, but you are not required to submit these plots.

References

Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter Mathematical Program-ming, 2011, Volume 127, Number 1, Page 3. Pegasos: primal estimated sub-gradient solver for SVM.

Solved–Homework Assignment 3– Solution

Description

Related products

Lab 1 Exploring Numerical Error Solution

Project One: Top of Pile Solution

Mario Level Generation Solution

Homework 5: Minion Agents Solution

Homework 1: Grid Navigation Solution