HOMEWORK 1 – V2 Solution



1. Probability and Calculus.

1.1. Variance and covariance – 15 pts. Let X; Y be two independent random vectors in Rm.

  1. Show that their covariance is zero.

  1. For a constant matrix A 2 Rm m, show the following two properties:

E(X + AY ) = E(X) + AE(Y )

Var(X + AY ) = Var(X) + AVar(Y )AT

  1. Using part (b), show that if X N ( ; ), then AX N (A ; A AT ). Here, you may use the fact that linear transformation of a Gaussian random vector is again Gaussian.

1.2. Densities – 10 pts. Answer the following questions:

  1. Can a probability density function (pdf) ever take values greater than 1?

  1. Let X be a univariate normally distributed random variable with mean 0 and variance 1=100. What is the pdf of X?

  1. What is the value of this pdf at 0?

  1. What is the probability that X = 0?

1.3. Calculus – 10 pts. Let x; y 2 Rm and A 2 Rm m. In vector notation, what is

    1. the gradient with respect to x of xT y?

    2. the gradient with respect to x of xT x?

    3. the gradient with respect to x of xT Ax?

    4. the gradient with respect to x of Ax?

  1. Regression.

2.1. Linear regression – 15 pts. Suppose that X 2 Rn m with n m and Y 2 Rn, and that

j N 2 ^

Y X; (X ; I). We know that the maximum likelihood estimate of is given by

^ = (XT X) 1XT Y:



Find the distribution of , its expectation and covariance matrix.


Write the log-likelihood implied by the model above, and compute its gradient w.r.t. .


Assuming that



is in

is known, what is the probability that an individual parameter i

j ^ j

the -neighborhood of the corresponding entry of the true parameter i, i.e. P( i i )? (Hint: Use Gaussian CDF (t).)


2.2. Ridge regression and MAP – 20 pts. Suppose that we have Y jX; N (X ; 2I) and we place a normal prior on , i.e., N (0; 2I).

(a) Show that the MAP estimate of given Y in this context is

^ T 1 T

MAP = (X X + I) X Y

where = 2= 2.

  1. Show that ridge regression is equivalent to adding m additional rows to X where the j-th p

additional row has its j-th entry equal to and all other entries equal to zero, adding m corresponding additional entries to Y that are all 0, and and then computing the maximum likelihood estimate of using the modi ed X and Y .

2.3. Cross validation – 30 pts. In this problem, you will write a function that performs K-fold cross validation procedure to tune the penalty parameter in Ridge regression. Your cross_validation function will rely on 6 short functions which are de ned below along with their variables.

data is a variable and refers to a (y; X) pair (can be test, training, or validation) where y

is the target (response) vector, and X is the feature matrix.


model is a variable and refers to the coe cients of the trained model, i.e. .

data_shf = shuffle_data(data) is a function and takes data as an argument and returns its randomly permuted version along the samples. Here, we are considering a uniformly random permutation of the training data. Note that y and X need to be permuted the same way preserving the target-feature pairs.

data_fold, data_rest = split_data(data, num_folds, fold) is a function that takes data, number of partitions as num_folds and the selected partition fold as its arguments and returns the selected partition (block) fold as data_fold, and the remaining data as data_rest. If we consider 5-fold cross validation, num_folds=5, and your function splits

the data into 5 blocks and returns the block fold (2 f1; 2; 3; 4; 5g) as the validation fold and the remaining 4 blocks as data_rest. Note that data_rest [ data_fold = data, and data_rest \ data_fold = ;.

model = train_model(data, lambd) is a function that takes data and lambd as its argu-ments, and returns the coe cients of ridge regression with penalty level . For simplicity, you may ignore the intercept and use the expression in question 2.2.

predictions = predict(data, model) is a function that takes data and model as its arguments, and returns the predictions based on data and model.

error = loss(data, model) is a function which takes data and model as its arguments and returns the average squared error loss based on model. This means if data is composed

of y 2 Rn and X 2 Rn p, and model is ^, then the return value is ky X ^k2=n.

cv_error = cross_validation(data, num_folds, lambd_seq) is a function that takes the training data, number of folds num_folds, and a sequence of ‘s as lambd_seq as its arguments and returns the cross validation error across all ‘s. Take lambd_seq as evenly spaced 50 numbers over the interval (0.02, 1.5). This means cv_error will be a vector of 50 errors corresponding to the values of lambd_seq. Your function will look like:

data = shuffle_data(data)

for i = 1,2,…,length(lambd_seq)


lambd = lambd_seq(i)

cv_loss_lmd = 0.

for fold = 1,2, …,num_folds

val_cv, train_cv = split_data(data, num_folds, fold)

model = train_model(train_cv, lambd)

cv_loss_lmd += loss(val_cv, model)

cv_error(i) = cv_loss_lmd / num_folds

return cv_error

  1. Download the dataset from the course webpage dataset.mat and place it in your working di-rectory, or note its location file_path. For example, le path could be /Users/yourname/Desktop/

In R: library(R.matlab)

dataset = readMat(‘file_path/dataset.mat’) data.train.X = dataset$data.train.X data.train.y = dataset$data.train.y[1,] data.test.X = dataset$data.test.X data.test.y = dataset$data.test.y[1,]

In Python:

import scipy.io as sio

dataset = sio.loadmat(‘file_path/dataset.mat’) data_train_X = dataset[‘data_train_X’] data_train_y = dataset[‘data_train_y’][0] data_test_X = dataset[‘data_test_X’] data_test_y = dataset[‘data_test_y’][0]

  1. Write the above 6 functions, and identify the correct order and arguments to do cross validation.

  1. Find the training and test errors corresponding to each in lambd_seq. This part does not use the cross_validation function but you may nd the other functions helpful.

  1. Plot training error, test error, and 5-fold and 10-fold cross validation errors on the same plot for each value in lambd_seq. What is the value of proposed by your cross validation procedure? Comment on the shapes of the error curves.