Your cart is currently empty!
IMPORTANT!!!!! Before proceeding, please carefully read the homework instructions: www.cs.ubc.ca/~schmidtm/Courses/540-W18/assignments.pdf We will deduct 50% on assignments that do not follow the instructions. Most of the questions below are related to topics covered in CPSC 340, or other courses listed on the prerequisite form. There are several \notes” available on the webpage which can…
IMPORTANT!!!!! Before proceeding, please carefully read the homework instructions:
www.cs.ubc.ca/~schmidtm/Courses/540-W18/assignments.pdf
We will deduct 50% on assignments that do not follow the instructions.
Most of the questions below are related to topics covered in CPSC 340, or other courses listed on the prerequisite form. There are several \notes” available on the webpage which can help with some relevant background.
If you nd this assignment to be di cult overall, that is an early warning sign that you may not be prepared to take CPSC 540 at this time. Future assignments will be longer and more di cult than this one.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
Basic Information
Give a short and concise 1-2 sentence answer to the below questions.
1
2.1Â Â Â Â Minimizing Strictly-Convex Quadratic Functions
Solve for the minimizer w of the below strictly-convex quadratic functions:
21 | P | n | |
3. f(w) = | i=1 vi(w>xi  yi)2 + 21 (w u)> (w u) (weighted least squares shrunk towards u). |
Above we use our usual supervised learning notation. In addition, we assume that u is d 1 and v is n 1, while and are symmetric positive-de nite d d matrices. You can use V as a diagonal matrix with v along the diagonal (with the vi non-negative). Hint: positive-de nite matrices are invertible.
2
2.2Â Â Â Â Norm Inequalities
Show that the following inequalities hold for vectors w 2 Rd, u 2 Rd, and X 2 Rn d:
You should use the de nitions of the norms, but should not use the known equivalences between these norms (since these are the things you are trying to prove). Hint: for many of these it’s easier if you work with squared values (and you may need to \complete the square”). Beyond non-negativity of norms,
it may also help to use the Cauchy-Schwartz inequality, to use that | k | x | k1 | = x>sign(x), and to use that | ||||
n | d | min | n;d | |||||
Pi=1 | Pj=1 xij2 | = Pc=1f |  | g c(X)2 (where c(X) is singular value c of X). | ||||
2.3 | MAP Estimation | |||||||
In 340, we showed that under the assumptions of a Gaussian likelihood and Gaussian prior,
yi  N (w>xi; 1); wj  N | 0; | ; | ||
1 | ||||
that the MAP estimate is equivalent to solving the L2-regularized least squares problem
1 | n | d | ||||
Xi | Â | X | ||||
f(w) = 2 | =1 | (w>xi  yi)2 + 2 | wj2; | |||
j=1 |
in the \loss plus regularizer” framework. For each of the alternate assumptions below, write it in the \loss plus regularizer” framework (simplifying as much as possible):
yi     L(w>xi; 1);      wj      N 0; j2 :
(w>xi  yi)2 |  | +1 | ||||||||||||||||||
p(yi | xi; w) = | 1 | 1 + | 2 | ;Â wj | uj; | 1 | ; | ||||||||||||
1 | ||||||||||||||||||||
j | p | B | ; | N | ||||||||||||||||
2 | 2 | |||||||||||||||||||
1 | ||||||||||||||||||||
where u is d 1, B is the \Beta” function, and the parameter is called the \degrees of freedom”. |
p(yi | w>xi) = | exp(yiw>xi) exp(Â exp(w>xi)) | ; | p(w | ) | / | : | |
yi! | ||||||||
j | j |
(This prior is \improper” since w 2 Rd but it doesn’t integrate to 1 over this domain, but nevertheless the posterior will be a proper distribution.)
For this question, you do not need to convert to matrix notation.
Â
3
2.4Â Â Â Â Gradients and Hessian in Matrix Notation
Express the gradient rf(w) and Hessian r2f(w) of the following functions in matrix notation, simplifying as much as possible:
f(w) = 12 kXw   yk2 + 2 kwk2 + w>u:
wher u is d    1.
1 | n | 1 | d | d | |||
X | vi(w>xi  yi)2 + | X Xj | |||||
f(w) = | wiwj ij | ||||||
2 | i=1 | 2 | i=1 | ||||
=1 |
where you can use V as a matrix with the vi along the diagonal and as a positive-de nite d d (symmetric) matrix with ij in position (i; j).
n
f(w) = 12 X maxf0; 1Â Â Â Â Â yiw>xig 2 :
i=1
Hint: You can use the results from the linear and quadratic gradients and Hessians notes to simplify the derivations. You can use 0 to represent the zero vector or a matrix of zeroes and I to denote the identity matrix. It will help to convert the second question to matrix notation rst. For the last question you’ll need to de ne new vectors to express the gradient and Hessian in matrix notation and you can use as element-wise multiplication of vectors. As a sanity check, make sure that your results have the right dimension.
If you have not previously used Julia, there is a list of useful Julia commands (and syntax) among the list of notes on the course webpage.
3.1Â Â Â Â Regularization and Hyper-Parameter Tuning
Download a1.zip from the course webpage, and start Julia (latest version) in a directory containing the extracted les. If you run the script example nonLinear, it will:
This script uses the JLD package to load the data and the PyPlot package to make the plot. If you have not previously used these packages, they can be installed using:2
2Last  term,   several people (eventually  including myself)  had  a  runtime   problem  on  some system.      This
seems to be xed using the answer of K. Gkinis at this url: https://stackoverflow.com/questions/46399480/ julia-runtime-error-when-using-pyplot
4
using Pkg
Pkg.add(“JLD”)
Pkg.add(“PyPlot”)
Unfortunately, this is not a great model of the data, and the gure shows that a linear model is probably not suitable.
You should start from the leastSquares function and use the same conventions: n refers to the number of training examples, d refers to the number of features, X refers to the data matrix, y refers to the targets, Z refers to the data matrix after the change of basis, and so on. Note that you’ll have to add two additional input arguments ( for the regularization parameter and for the Gaussian RBF variance) compared to the leastSquares function. To make your code easier to understand/debug, you may want to de ne a new function rbfBasis which computes the Gaussian RBFs for a given training set, testing set, and value. Hand in your function and the plot generated with = 1 and = 1.
that multiplication by an n by d matrix costs O(nd) and that solving a d by d linear system costs O(d3).
Note: the distancesSquared function in misc.jl is a vectorized way to quickly compute the squared Euclidean distance between all pairs of rows in two matrices.
3.2Â Â Â Â Multi-Class Logistic Regression
The script example multiClass.jl loads a multi-class classi cation dataset and ts a \one-vs-all” logistic regression classi er, then reports the validation error and shows a plot of the data/classi er. The performance on the validation set is ok, but could be much better. For example, this classi er never even predicts some of the classes.
Using a one-vs-all classi er hurts performance because the classi ers are t independently, so there is no attempt to calibrate the columns of the matrix W . An alternative to this independent model is to use the softmax probability,
exp(w>i xi)
p(yijW; xi) =                       y             :
Pk        exp(w>xi)
c=1Â Â Â Â Â Â Â Â Â Â Â Â c
Here c is a possible label and wc is column c of W . Similarly, yi is the training label, wyi is column yi of W . The loss function corresponding to the negative logarithm of the softmax probability is given by
f(W ) = | n “ | wy>i xi + log | k | exp(wc>0 xi)!# : |
 | X | cX0 | ||
i=1 | =1 |
5
Make a new function, softmaxClassi er, which ts W using the softmax loss from the previous section instead of tting k independent classi ers. Hand in the code and report the validation error.
Hint: you can use the derivativeCheck option when calling ndMin to help you debug the gradient of the softmax loss. Also, note that the ndMin function treats the parameters as a vector (you may want to use reshape when writing the softmax objective).
3.3Â Â Â Â Robust and Brittle Regression
The script example outliers.jl loads a one-dimensional regression dataset that has a non-trivial number of \outlier” data points. These points do not t the general trend of the rest of the data, and pull the least squares model away from the main cluster of points. One way to improve the performance in this setting is simply to remove or downweight the outliers. However, in high-dimensions it may be di cult to determine whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable to replace the squared error with an error that is more robust to outliers.
f(w) = kXw          yk1:
You should turn this into a linear program as shown in class, and you can solve this linear program using the linprog function the MathProgBase package. Hand in the new function and the updated plot.
f(w) = kXw       yk1:
This objective function is non-smooth because of the absolute value function as well as the max function.
Show how to formulate this non-smooth optimization problem as a linear program.
To use the linprog function, you can use:
using MathProgBase, GLPKMathProgInterface solution = linprog(c,A,d,b,lb,ub,GLPKSolverLP()) x = solution.sol
This requires installing the appropriate packages, and nds a vector x minimizing the function c>x subject to d Ax b and lb x ub. You can set values of c to 0 for variables that don’t a ect the cost function, and you can set values in b=d=lb/ub (or the other variables) to appropriate in nite values if there are no lower/upper bounds. The vectors c=d=b=lb=ub should all be lists.
6