$35.00
Description
Instructions
Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named Homework 2 set up in which you can submit your files. Please be sure to follow all directions outlined here.
You can submit multiple times, but only the last submission counts. That means if you finish some problems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (remember Murphy’s Law), you still get credit for whatever you have turned in.
Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submission must be a single PDF. It is strongly recommended that you typeset with L^{A}T_{E}X. There are many free online L^{A}T_{E}X editors that are convenient to use (e.g Overleaf). You can also use offline editor such as TeXShop.
Please also follow the rules below:
The file should be named as Firstname Lastname USCID.pdf e.g., Jeff Dean 8675309045.pdf. Do not have any spaces in your file name when uploading it.
Please include your name and USCID in the header of your report as well.
Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.
Note on notation: Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.
1
Problem 1 Convergence of Perceptron Algorithm (30 points)
In this problem you need to show that when the two classes are linearly separable, the perceptron algorithm will converge. Specifically, for a binary classification dataset of N data points, where every x_{i} has a
corresponding label y_{i} 2 f 1, 1g and is normalized: kx_{i}k = 
x_{i}^{T} x_{i} = 1, 8i 2 f1, 2, . . . , Ng, the perceptron 
algorithm proceeds as below: 
q 
while not converged do
Pick a data point (x_{i}, y_{i}) randomly
Make a prediction y = sign w^{T} x_{i} using current w
if y 6= y _{i} then

w + y_{i}x_{i}
end
end
Algorithm 1: Perceptron
In other words, weights are updated right after the perceptron makes a mistake (weights remain unchanged if the perceptron makes no mistakes). Let the (classification) margin for a hyperplane w be g(w) =
mini2[N] ^{j}^{w}^{T} ^{x}^{i} ^{j} (convince yourself that g(w) is the smallest distance of any data point from the hyperplane).
kwk
Let w_{opt} be the optimal hyperplane, i.e. it linearly separates the classes with maximum margin. Note that since data is linearly separable there will always exist some w_{opt}. Let g = g(w_{opt}).
Following the steps below, you will show that the perceptron algorithm makes a finite number of mistakes that is at most g ^{2}, and therefore the algorithm must converge.
1.1 Show that if the algorithm makes a mistake, the update rule moves it towards the direction of the optimal weights w_{opt}. Specifically, denoting explicitly the updating iteration index by k, the current weight vector by w_{k}, and the updated weight vector by w_{k}_{+}_{1}, show that, if y_{i}w_{k}^{T} x_{i} < 0, we have

w_{k}^{T}_{+}_{1}w_{opt} w_{k}^{T} w_{opt} + g
^{w}opt
(1)
(5 points)
Hint: Consider (w_{k}_{+}_{1} w_{k})^{T} w_{opt} and consider the property of w_{opt}.
1.2 Show that the length of updated weights does not increase by a large amount. Mathematically show that, if y_{i}w_{k} ^{T} x_{i} < 0
kw_{k}_{+}_{1}k^{2} kw_{k}k^{2} + 1 
(2) 
. 
(5 points) 
Hint: Consider kw_{k}_{+}_{1}k^{2} and substitute w_{k}_{+}_{1}.
1.3 Assume that the initial weight vector w_{0} = 0 (an allzero vector). Using results from Problem 1.1 and
1.2 , show that for any iteration k + 1, with M being the total number of mistakes the algorithm has made 

for the first k iterations, we have 
p 
(3) 

gM kw_{k}_{+}_{1}kM 

Hint: use CauchySchwartz inequality a^{T} b kakkbk and telescopic sum. 
(15 points) 

1.4 Using result of Problem 1.3, conclude M g ^{2}. 
(5 points) 
2
Problem 2 Logistic Regression (30 points)
Recall that the logistic regression model is defined as:
p(y = 1jx) = s w^{T} x + b
1
^{s}^{(}^{z}^{) =} _{1} _{+} _{e} z
Given a training set D = f(x_{n}, y_{n})g_{n}^{N}_{=}_{1}, where x_{n} 2 R^{K} ^{1} and y_{n} entropy error function to solve w.
(4)
(5)
2 f0, 1g, we will minimize the cross
min 
L 
( 
, 
b 
) = min 
åfy_{n} log 
[ 
p 
( 
y_{n} 
= 
1jx_{n} 
)] 
+ ( 
1 
y_{n} 
) 
log 
[ 
p 
( 
y_{n} 
= 
0jx_{n} 
)] 
g 

w,b 
w 
w,b 

n 
^{n}_{y}_{n} _{log} ^{h}_{s} _{w}T _{x}_{n} _{+} _{b}^{ i} _{+ (}_{1} 
y_{n}) log ^{h} 
_{s} _{w}T _{x}_{n} _{+} _{b}^{ io} 
(6) 

= min_{w}_{,b} 
å_{n} 
1 

2.1 
Please derive the update rule for w using Gradient Descent (GD) method. (10 points) 

2.2 
Suppose we have four training samples (x_{1}, y_{1}) 
= 
(1, 0), (x_{2}, y_{2}) 
= 
(1, 1), (x_{3}, y_{3}) = (1, 1) and 
(x_{4}, y_{4}) = (1, 1). Suppose our logistic regression model is p(y = 1jx) = s(wx) We initialize this model with w = 0 and use learning rate = 0.001. When using GD to optimize this model, after one batch iteration, what’s the training accuracy? (15 points)
2.3 Based on the model we get in problem 2.2, if we have a test dataset containing three samples: (x_{1}, y_{1}) = ( 1, 0), (x_{2}, y _{2}) = (1, 1), (x_{3}, y_{3}) = (1, 0), what is the test accuracy? (5 points)
3
Problem 3 Backpropagation (40 points)
Suppose we have a MultiClass Neural Networks defined below. An illustration is provided in Fig. 1. Please answer the following questions.
input layer hidden layer
output layer
x_{1 } z_{1}
y_{1}
x_{2 } z_{2}
y_{2}
x_{3 } z_{3}
^{w}43
^{v}24 _{ y}_{3}
x_{4 } z_{4}
Figure 1: A neural network with one hidden layer.
Forward Propagation. For multiclass classification, we use softmax layer with crossentropy loss as output.
In the hidden layer, we use tanh activation function. The forward propagation can be expressed as:
input layer 
x_{i}, 
i=1 
! 
e 
+ e 
(7) 

4 
e^{a} 
e 
a 

hidden layer 
^{z}k 
= tanh 
å ^{w}ki ^{x}i 
, tanh(a) = 
(8) 

a 
a 

å_{i=1} exp (o_{i}) 
k=1 

exp o_{j} 
4 

output layer 
yˆ_{j} 
= so f tmax(o_{j}) = 
, where o_{j} = å v_{jk}z_{k} 
(9) 

3 

3 

loss function 
L(y, yˆ) = 
å y_{j} log yˆ_{j}, where yˆ_{j} 
is prediction, y_{j} is ground truth 
(10) 

j=1 

Backpropagation Please write down 
¶L 
and 
¶L 
in terms of only x_{i}, z_{k}, o_{j}, y_{j}, and/or yˆ_{j} using backpropagation.(40 

^{¶}^{v}jk 
^{¶}^{w}ki 

points)
4