Your cart is currently empty!
Instructions: Solutions to problems 1 and 2 are to be submitted on Blackboard (PDF les strongly preferred). You are strongly encouraged to do problems 3{6 but these are not to be submitted for grading. Suppose that S is an n n matrix where n may be very large and the elements of…
Instructions: Solutions to problems 1 and 2 are to be submitted on Blackboard (PDF les strongly preferred). You are strongly encouraged to do problems 3{6 but these are not to be submitted for grading.
of its diagonal elements. For example, if S is a smoothing matrix in regression (yb = Sy) then the trace of S gives a measure of the e ective number of parameters using in the smoothing method. (In multiple regression models, the smoothing matrix is the projection matrix X(XT X) 1XT whose trace is the number of columns of X.)
non-random matrix, show that
h i h i h i
E V T SV = E tr SV V T = tr SE V V T = tr(S)
and so tr(S) can be estimated by
1 m
X T
tr(dS) = m i=1 V i SV i
where V 1; ; V m are independent random vectors with E[V iV Ti ] = I.
Hint: This is easier than it looks { Var(V T SV ) = E[(V T SV )2] tr(S)2 so it su ces to
minimize | n n |
n n | |
E[(V T SV )2] = | sijsk`E(ViVjVkV`): |
X X X X
i=1 j=1 k=1 `=1
Given our conditions on the elements of V i, V1; ; Vn, most of E(ViVjVkV`) are either 0 or 1. You should be able to show that
n
E[(V T SV )2] = X s2iiE(Vi4) + constant
i=1
1
and nd Vi to minimize E(Vi4) subject to E(Vi2) = 1.
(d) Suppose we estimate the function g in the non-parametric regression model
yi = g(xi) + “i for i = 1; ; n
using loess (i.e. the R function loess) where the smoothness is determined by the parameter span lying between 0 and 1. Given a set of predictors fxig and a value of span, write an R function to approximate the e ective number of parameters.
f(x; ; ) = | x 1 exp( x) | for x > 0 | |
( ) | |||
where > 0 and > 0 are unknown parameters.
algorithm for computing the MLEs based on x1; ; xn. Implement this algorithm in R and test on data generated from a Gamma distribution (using the R function rgamma). Your function should also output an estimate of the variance-covariance matrix of the MLEs { this can be obtained from the Hessian of the log-likelihood function.
Important note: To implement the Newton-Raphson algorithm, you will need to compute the rst and second derivatives of ln ( ). These two derivatives are called (respectively) the digamma and trigamma functions, and these functions are available in R as digamma and trigamma; for example,
2
Supplemental problems:
n | p |
Xi | (yi y xiT )2 +j jj |
X | |
=1 | j=1 |
for some > 0. (We assume that the predictors are centred and scaled to have mean 0 and variance 1, in which case y is the estimate of the intercept.) Suppose that the least squares estimate (i.e. for = 0) is non-unique | this may occur, for example, if there is some exact linear dependence in the predictors or if p > n. De ne
n
= min X(yi y xTi )2
i=1
and the set
C = | n | ) : |
( : (yi y xiT )2 = | ||
Xi | ||
=1 |
We want to look at what happens to the LASSO estimate b as # 0.
(a) Show that b minimizes
( | n | |||||
(yi y xiT | ||||||
1 | Xi | |||||
=1 | ||||||
(b) Find the limit of | ( | |||||
1 | n | |||||
(yi y | ||||||
=1 | ||||||
Xi |
)2 + X j jj:
j=1
)
xTi )2
as # 0 as a function of . (What happens when 62 ?)C Use this to deduce that as # 0,
b | b | b | p | |
minimizes | jX | |||
! 0 | where 0 | j jj on the set C. | ||
=1 |
g(x) = x2 2 x + jxj
where > 0 and 0 < < 1. (This problem arises, in a somewhat more complicated form, in shrinkage estimation in regression.) The function jxj has a \cusp” at 0, which mean that if is su cient large then g is minimized at x = 0.
3
(a) g is minimized at x = 0 if, and only if,
2 | “ | 2 2 | #1 | j j2 : | (1) | |||
2 | 2 | |||||||
Otherwise, g is minimized at x satisfying g0(x ) = 0. Using R, compare the following two iterative algorithms for computing x (when condition (1) does not hold):
(i) Set x0 = and de ne
x | = | jxk 1j | k = 1; 2; 3; | |||||
k | 2 xk 1 | |||||||
(ii) The Newton-Raphson algorithm with x0 = .
Use di erent values of , , and to test these algorithms. Which algorithm is faster?
shrinks least squares estimates towards 0) and model selection methods (which produce ex-act 0 estimates for some or all parameters). Bridge estimates b minimize (for some > 0
and > 0),
n | p | |
(yi xiT )2 + | j jj : | (2) |
Xi | X | |
=1 | j=1 |
See the paper by Huang, Horowitz and Ma (2008) (\Asymptotic properties of bridge esti-mators in sparse high-dimensional regression models” Annals of Statistics. 36, 587{613) for details. Describe how the algorithms in part (a) could be used to de ne a coordinate descent algorithm to nd b minimizing (2) iteratively one parameter at a time.
(c) Prove that g is minimized at 0 if, and only if, condition (1) in part (a) holds.
n 0. Consider the following algorithm for computing the maximum eigenvalue 1:
Given x0, de ne for k = 0; 1; 2; , xk+1 = | Axk | and k+1 = | xkT+1Axk+1 | . | |
kAxkk2 | xkT+1xk+1 | ||||
Under certain conditions, k ! 1, the maximum eigenvalue of A; this algorithm is known as the power method and is particularly useful when A is sparse.
(a) Suppose that v1; ; vn are the eigenvectors of A corresponding to the eigenvalues 1; ; n. Show that k ! 1 if xT0 v1 6= 0 and 1 > 2.
4
Gauss-Seidel algorithm to estimate f ig). Use both gradient descent and accelerated gradient descent to estimate f ig. To nd an appropriate value of , it is useful to approximate the maximum eigenvalue of the Hessian matrix of the objective function { the algorithm in problem 5 is useful in this regard.
5