$30.00
Description
1. Probability and Calculus.
1.1. Variance and covariance – 15 pts. Let X; Y be two independent random vectors in R^{m}.

Show that their covariance is zero.

For a constant matrix A 2 R^{m m}, show the following two properties:
E(X + AY ) = E(X) + AE(Y )
Var(X + AY ) = Var(X) + AVar(Y )A^{T}

Using part (b), show that if X N ( ; ), then AX N (A ; A A^{T} ). Here, you may use the fact that linear transformation of a Gaussian random vector is again Gaussian.
1.2. Densities – 10 pts. Answer the following questions:

Can a probability density function (pdf) ever take values greater than 1?

Let X be a univariate normally distributed random variable with mean 0 and variance 1=100. What is the pdf of X?

What is the value of this pdf at 0?

What is the probability that X = 0?
1.3. Calculus – 10 pts. Let x; y 2 R^{m} and A 2 R^{m m}. In vector notation, what is


the gradient with respect to x of x^{T} y?

the gradient with respect to x of x^{T} x?

the gradient with respect to x of x^{T} Ax?

the gradient with respect to x of Ax?


Regression.
2.1. Linear regression – 15 pts. Suppose that X 2 R^{n m} with n m and Y 2 R^{n}, and that
j N ^{2} ^{^}
Y X; (X ; I). We know that the maximum likelihood estimate of is given by
^ _{= (X}^{T} _{X)} ^{1}_{X}^{T} _{Y:} 

(a) 
^ 

Find the distribution of , its expectation and covariance matrix. 

(b) 
Write the loglikelihood implied by the model above, and compute its gradient w.r.t. . 

(c) 
Assuming that 
2 
^ 
is in 
is known, what is the probability that an individual parameter _{i} 
j ^{^} j
the neighborhood of the corresponding entry of the true parameter _{i}, i.e. P( _{i i} )? (Hint: Use Gaussian CDF (t).)
1
2.2. Ridge regression and MAP – 20 pts. Suppose that we have Y jX; N (X ; ^{2}I) and we place a normal prior on , i.e., N (0; ^{2}I).
(a) Show that the MAP estimate of given Y in this context is
^{^} T 1 T
_{MAP} = (X X + I) X Y
where = ^{2}= ^{2}.

Show that ridge regression is equivalent to adding m additional rows to X where the jth p
additional row has its jth entry equal to and all other entries equal to zero, adding m corresponding additional entries to Y that are all 0, and and then computing the maximum likelihood estimate of using the modi ed X and Y .
2.3. Cross validation – 30 pts. In this problem, you will write a function that performs Kfold cross validation procedure to tune the penalty parameter in Ridge regression. Your cross_validation function will rely on 6 short functions which are de ned below along with their variables.
data is a variable and refers to a (y; X) pair (can be test, training, or validation) where y

is the target (response) vector, and X is the feature matrix.
^
model is a variable and refers to the coe cients of the trained model, i.e. .
data_shf = shuffle_data(data) is a function and takes data as an argument and returns its randomly permuted version along the samples. Here, we are considering a uniformly random permutation of the training data. Note that y and X need to be permuted the same way preserving the targetfeature pairs.
data_fold, data_rest = split_data(data, num_folds, fold) is a function that takes data, number of partitions as num_folds and the selected partition fold as its arguments and returns the selected partition (block) fold as data_fold, and the remaining data as data_rest. If we consider 5fold cross validation, num_folds=5, and your function splits
the data into 5 blocks and returns the block fold (2 f1; 2; 3; 4; 5g) as the validation fold and the remaining 4 blocks as data_rest. Note that data_rest [ data_fold = data, and data_rest \ data_fold = ;.
model = train_model(data, lambd) is a function that takes data and lambd as its arguments, and returns the coe cients of ridge regression with penalty level . For simplicity, you may ignore the intercept and use the expression in question 2.2.
predictions = predict(data, model) is a function that takes data and model as its arguments, and returns the predictions based on data and model.
error = loss(data, model) is a function which takes data and model as its arguments and returns the average squared error loss based on model. This means if data is composed
of y _{2} Rn and X _{2} Rn p, and model is ^, then the return value is _{k}y X ^_{k}2=n.
cv_error = cross_validation(data, num_folds, lambd_seq) is a function that takes the training data, number of folds num_folds, and a sequence of ‘s as lambd_seq as its arguments and returns the cross validation error across all ‘s. Take lambd_seq as evenly spaced 50 numbers over the interval (0.02, 1.5). This means cv_error will be a vector of 50 errors corresponding to the values of lambd_seq. Your function will look like:
data = shuffle_data(data)
for i = 1,2,…,length(lambd_seq)
2
lambd = lambd_seq(i)
cv_loss_lmd = 0.
for fold = 1,2, …,num_folds
val_cv, train_cv = split_data(data, num_folds, fold)
model = train_model(train_cv, lambd)
cv_loss_lmd += loss(val_cv, model)
cv_error(i) = cv_loss_lmd / num_folds
return cv_error

Download the dataset from the course webpage dataset.mat and place it in your working directory, or note its location file_path. For example, le path could be /Users/yourname/Desktop/
In R: library(R.matlab)
dataset = readMat(‘file_path/dataset.mat’) data.train.X = dataset$data.train.X data.train.y = dataset$data.train.y[1,] data.test.X = dataset$data.test.X data.test.y = dataset$data.test.y[1,]
In Python:
import scipy.io as sio
dataset = sio.loadmat(‘file_path/dataset.mat’) data_train_X = dataset[‘data_train_X’] data_train_y = dataset[‘data_train_y’][0] data_test_X = dataset[‘data_test_X’] data_test_y = dataset[‘data_test_y’][0]

Write the above 6 functions, and identify the correct order and arguments to do cross validation.

Find the training and test errors corresponding to each in lambd_seq. This part does not use the cross_validation function but you may nd the other functions helpful.

Plot training error, test error, and 5fold and 10fold cross validation errors on the same plot for each value in lambd_seq. What is the value of proposed by your cross validation procedure? Comment on the shapes of the error curves.
3