$30.00
Description
Instructions
Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named ‘Theory Assignment 3’ set up in which you can submit your files. Please be sure to follow all directions outlined here.
You can submit multiple times, but only the last submission counts. That means if you finish some problems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (remember Murphy’s Law), you still get credit for whatever you have turned in.
Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submission must be a single PDF. It is strongly recommended that you typeset with L^{A}T_{E}X. There are many free integrated L^{A}T_{E}X editors that are convenient to use (e.g Overleaf, ShareLaTeX). Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not know how to use L^{A}T_{E}X yet.
Please also follow the rules below:

The file should be named as firstname lastname USCID.pdf e.g., Don Quijote de la Mancha 8675309045.pdf).

Do not have any spaces in your file name when uploading it.

Please include your name and USCID in the header of your report as well.
Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration. Review the rules on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly more than you could stand to gain.
Notes on notation:

Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

k.k means L2norm unless specified otherwise i.e. k.k = k.k_{2}
1
Problem 1 Principle Component Analysis (25 points)
In this problem, we use proof by induction to show that the Mth principle component corresponds to the Mth eigenvector of X^{T} X sorted by the eigenvalue from largest to smallest. Here X is the centered data matrix and we denote the sorted eigenvalues as l_{1} l_{2} … l_{d}. In the lecture, the results was proven for M = 1. Now suppose the result holds for a value M, and you are going to show that it holds for M + 1. Note that the M + 1 principle component corresponds to the solution of the following optimization problem:

max
v^{T} X^{T} Xv
(1)
v
kvk_{2} = 1
s.t.
(2)
v^{T} v = 0, i = 1, …, M
(3)
i
where v_{i} is the ith principle component. Write down the Lagrangian of the optimization problem above, and show that the solution v_{M}_{+}_{1} is an eigenvector of X^{T} X. Then show that the quantity in (1) is maximized when the v_{M}_{+} _{1} is the eigenvector with eigenvalue l_{M}_{+}_{1}.
2
Problem 2 Support Vector Regression (30 points)
In this problem, we derive an extension of support vector machine to regression problem, called Support Vector Regression (SVR). Define the regressor f (x) = w^{T} f(x) + b, and given a dataset f(x_{n}, y_{n})g_{n}^{N}_{=}_{1}, y_{n} 2 R. Intuitively, we want to find a regressor that has small weight w and also ensure small approximation error to f(x_{n}, y _{n})g_{n}^{N}_{=}_{1}. The intuition can be formulated as the following optimization problem:
min ^{1} kwk^{2}
w,b 2 ^{2}
s.t. jw^{T} f(x_{n}) + b y_{n}j e
For an arbitrary dataset, the eclose constraint may not be feasible, Therefore, we optimize the “soft” version of the loss above:

1
2
N
min
+
^{C} å ^{E}e
(
^{y}n
f
(
^{x}n
))
(4)
w,b
_{2} ^{k}^{w}^{k}2
n=1
E_{e} is the einsensitive error function which gives zero error if the difference between prediction and ground truth is smaller than e and incurs linear penalty otherwise. It is defined as follow:

e
( _{x}
e
j_{x}
j
> e
E
(x) =
0
x
j
e
j j
j
Question 1 Reformulate the unconstrained optimization problem in equation 4 as a constraint optimization problem by introducing slack variables for each data points. Hint: For each data point, introduce slack
variables x_{n} 
n 
0 such that 
n 
y_{n} f (x_{n}) 
n 

0, x^{0} 
e x^{0} 
e + x_{n}. Then replace E_{e} with x_{n}, x^{0} . (12 points) 
Question 2 Write down the Lagrangian of the constrained optimization derived in Question 1, then mini
mize the Lagrangian by taking derivative w.r.t w, b, x_{n}, x_{n}^{0} and set the gradient to 0, and simplify expressions. 

Hint: there are no b, x_{n}, x_{n}^{0} in the final expressions. 
(18 points) 
3
(25 points) 

( 
) 
is a real value, and 
y 2 f 
1, 1 
g 
is the class label. 

Consider the dataset consisting of points 
x, y 
, where x 
p 

There are only three points (x_{1}, y_{1}) = (0, 1), (x_{2}, y_{2}) = ( 
, 1), (x_{3}, y_{3}) = (p, 1). Let the feature mapping 

2 

f(x) = [cos x, sin x]^{T}, corresponding to the kernel function k(x, y) = cos(x 
y). 
Question 1 Write down the primal and dual formulations of SVM for this dataset in the transformed twodimensional feature space based on f( ). Note that we assume the data points are separable and set the hyperparameter C to be +¥, which forces all slack variables (x) in the primal formulation to be 0 (and thus
can be removed from the optimization). (12 points)
Question 2 Next, solve the dual formulation. Based on that, derive the primal solution. (13 points)
4
Problem 4 Boosting (20 points)
Recall the procedure of AdaBoost algorithm described in class:
Algorithm 1: Adaboost

Given: A training set f(x_{n}, y_{n} 2 f+1, 1g)g_{n}^{N}_{=}_{1}, and a set of classifier H, where each h 2 H takes a
feature vector as input and outputs +1 or 1.
2 Goal: Learn H(x) = sign å_{t}^{T}_{=}_{1} b_{t}h_{t}(x)

Initialization: D_{1}(n) = _{N}^{1} , 8n 2 [N].
4 for t = 1, 2,, T do

^{Find} ^{h}t ^{=} ^{arg min}_{h}_{2H} ån:y_{n} 6=h(x_{n} ) ^{D}t^{(}^{n}^{)}^{.}

Compute
et ^{=} å 
D_{t}(n) 
and 
b_{t} = 
1 
log 
1 
e_{t} 
. 

2 

n:y_{n} 6=h_{t} (x_{n} ) 
e_{t} 

7Compute 

^{D}t+1 
(n) = 
_{D}_{t}_{(}_{n}_{)}_{e} ^{b}t ^{y}n ^{h}t ^{(}^{x}n ^{)} 

N 
D_{t}(n^{0})e 
b y 
_{n}0 
h 
(x 
_{n}0 
) 

for each n 2 [N] 
^{å}n^{0}=1 
t 
t 

Question 1 We discussed in class that AdaBoost minimizes the exponential loss greedily. In particular, Adaboost seeks the optimal b_{t} that minimizes
e_{t}(e^{b}^{t} 
_{e} ^{b}t _{) +} _{e} ^{b}t 

where e is the weighted classification error of 
^{h}t 
and is fixed. Show that b 
= 
1 
ln 
1 
e_{t} 

(8 points)^{t} 
2 
e_{t} 
is the minimizer.
Question 2 Recall that at round t of AdaBoost, a classifier h_{t} is obtained and the weighting over the training set is updated from D_{t} to D_{t}_{+}_{1}. Prove that h_{t} is only as good as random guessing in terms of
classification error weighted by D_{t}_{+}_{1}. That is (12 points)
å ^{D}t+1 
(n) = 
1 
. 
2 

n:h_{t} (x_{n} )6=y_{n} 

Hint: you can somehow ignore the denominator of D_{t}_{+}_{1} 
(n) to simplify calculation. 
5