\$30.00

Category:

## Description

Instructions

Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named ‘Theory Assignment 3’ set up in which you can submit your files. Please be sure to follow all directions outlined here.

You can submit multiple times, but only the last submission counts. That means if you finish some prob-lems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (re-member Murphy’s Law), you still get credit for whatever you have turned in.

Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submis-sion must be a single PDF. It is strongly recommended that you typeset with LATEX. There are many free integrated LATEX editors that are convenient to use (e.g Overleaf, ShareLaTeX). Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not know how to use LATEX yet.

• The file should be named as firstname lastname USCID.pdf e.g., Don Quijote de la Mancha 8675309045.pdf).

Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration. Review the rules on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly more than you could stand to gain.

Notes on notation:

• Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

• k.k means L2-norm unless specified otherwise i.e. k.k = k.k2

1

Problem 1 Principle Component Analysis (25 points)

In this problem, we use proof by induction to show that the M-th principle component corresponds to the M-th eigenvector of XT X sorted by the eigenvalue from largest to smallest. Here X is the centered data matrix and we denote the sorted eigenvalues as l1 l2ld. In the lecture, the results was proven for M = 1. Now suppose the result holds for a value M, and you are going to show that it holds for M + 1. Note that the M + 1 principle component corresponds to the solution of the following optimization problem:

 max vT XT Xv (1) v kvk2 = 1 s.t. (2) vT v = 0, i = 1, …, M (3) i

where vi is the i-th principle component. Write down the Lagrangian of the optimization problem above, and show that the solution vM+1 is an eigenvector of XT X. Then show that the quantity in (1) is maximized when the vM+ 1 is the eigenvector with eigenvalue lM+1.

2

Problem 2 Support Vector Regression (30 points)

In this problem, we derive an extension of support vector machine to regression problem, called Support Vector Regression (SVR). Define the regressor f (x) = wT f(x) + b, and given a dataset f(xn, yn)gnN=1, yn 2 R. Intuitively, we want to find a regressor that has small weight w and also ensure small approximation error to f(xn, y n)gnN=1. The intuition can be formulated as the following optimization problem:

min 1 kwk2

w,b 2 2

s.t. jwT f(xn) + b ynj e

For an arbitrary dataset, the e-close constraint may not be feasible, Therefore, we optimize the “soft” version of the loss above:

 1 2 N min + C å Ee ( yn f ( xn )) (4) w,b 2 kwk2 n=1

Ee is the e-insensitive error function which gives zero error if the difference between prediction and ground truth is smaller than e and incurs linear penalty otherwise. It is defined as follow:

 e ( x e jx j > e E (x) = 0 x j e j j j

Question 1 Reformulate the unconstrained optimization problem in equation 4 as a constraint optimiza-tion problem by introducing slack variables for each data points. Hint: For each data point, introduce slack

 variables xn n 0 such that n yn f (xn) n 0, x0 e x0 e + xn. Then replace Ee with xn, x0 . (12 points)

Question 2 Write down the Lagrangian of the constrained optimization derived in Question 1, then mini-

 mize the Lagrangian by taking derivative w.r.t w, b, xn, xn0 and set the gradient to 0, and simplify expressions. Hint: there are no b, xn, xn0 in the final expressions. (18 points)

3

 Problem 3 Support Vector Machine (25 points) ( ) is a real value, and y 2 f 1, 1 g is the class label. Consider the dataset consisting of points x, y , where x p There are only three points (x1, y1) = (0, 1), (x2, y2) = ( , 1), (x3, y3) = (p, 1). Let the feature mapping 2 f(x) = [cos x, sin x]T, corresponding to the kernel function k(x, y) = cos(x y).

Question 1 Write down the primal and dual formulations of SVM for this dataset in the transformed two-dimensional feature space based on f( ). Note that we assume the data points are separable and set the hyperparameter C to be +¥, which forces all slack variables (x) in the primal formulation to be 0 (and thus

can be removed from the optimization). (12 points)

Question 2 Next, solve the dual formulation. Based on that, derive the primal solution. (13 points)

4

Problem 4 Boosting (20 points)

Recall the procedure of AdaBoost algorithm described in class:

• Given: A training set f(xn, yn 2 f+1, 1g)gnN=1, and a set of classifier H, where each h 2 H takes a

feature vector as input and outputs +1 or 1.

2 Goal: Learn H(x) = sign åtT=1 btht(x)

• Initialization: D1(n) = N1 , 8n 2 [N].

4 for t = 1, 2,, T do

• Find ht = arg minh2H ån:yn 6=h(xn ) Dt(n).

• Compute

 et = å Dt(n) and bt = 1 log 1 et . 2 n:yn 6=ht (xn ) et 7Compute Dt+1 (n) = Dt(n)e bt yn ht (xn ) N Dt(n0)e b y n0 h (x n0 ) for each n 2 [N] ån0=1 t t

Question 1 We discussed in class that AdaBoost minimizes the exponential loss greedily. In particular, Adaboost seeks the optimal bt that minimizes

 et(ebt e bt ) + e bt where e is the weighted classification error of ht and is fixed. Show that b = 1 ln 1 et (8 points)t 2 et

is the minimizer.

Question 2 Recall that at round t of AdaBoost, a classifier ht is obtained and the weighting over the training set is updated from Dt to Dt+1. Prove that ht is only as good as random guessing in terms of

classification error weighted by Dt+1. That is (12 points)

 å Dt+1 (n) = 1 . 2 n:ht (xn )6=yn Hint: you can somehow ignore the denominator of Dt+1 (n) to simplify calculation.

5