$30.00
Description
Instructions: Solutions to problems 1 and 2 are to be submitted on Blackboard (PDF les strongly preferred). You are strongly encouraged to do problems 3{6 but these are not to be submitted for grading.

Suppose that S is an n n matrix where n may be very large and the elements of S may not be explicitly de ned. We are interested in approximating the trace of S, that is, the sum
of its diagonal elements. For example, if S is a smoothing matrix in regression (yb = Sy) then the trace of S gives a measure of the e ective number of parameters using in the smoothing method. (In multiple regression models, the smoothing matrix is the projection matrix X(X^{T} X) ^{1}X^{T} whose trace is the number of columns of X.)

Show that if A and B are m n and n m matrices, respectively, then tr(AB) = tr(BA). (This is a wellknown fact but humour me with a proof!)

Suppose that V is a random vector of length n such that E[V V ^{T} ] = I. If S is an n n
nonrandom matrix, show that
h i h i h i
E V ^{T} SV = E tr SV V ^{T} = tr SE V V ^{T} = tr(S)
and so tr(S) can be estimated by
1 ^{m}
X _{T}
^{tr(}^{d}^{S) =} _{m} _{i=1} ^{V} i ^{SV} ^{i}
where V _{1}; ; V _{m} are independent random vectors with E[V _{i}V ^{T}_{i} ] = I.

Suppose that the elements of each V _{i} are independent, identically distribution random variables with mean 0 and variance 1. Show that Var(tr(dS) is minimized by taking the elements of V _{i} to be 1 each with probability 1=2.
Hint: This is easier than it looks { Var(V ^{T} SV ) = E[(V ^{T} SV )^{2}] tr(S)^{2} so it su ces to
minimize 
n n 
n n 

E[(V ^{T} SV )^{2}] = 
s_{ij}s_{k`}E(V_{i}V_{j}V_{k}V_{`}): 
X X X X
i=1 j=1 k=1 `=1
Given our conditions on the elements of V _{i}, V_{1}; ; V_{n}, most of E(V_{i}V_{j}V_{k}V_{`}) are either 0 or 1. You should be able to show that
n
E[(V ^{T} SV )^{2}] = ^{X} s^{2}_{ii}E(V_{i}^{4}) + constant
i=1
1
and nd V_{i} to minimize E(V_{i}^{4}) subject to E(V_{i}^{2}) = 1.
(d) Suppose we estimate the function g in the nonparametric regression model
y_{i} = g(x_{i}) + “_{i} for i = 1; ; n
using loess (i.e. the R function loess) where the smoothness is determined by the parameter span lying between 0 and 1. Given a set of predictors fx_{i}g and a value of span, write an R function to approximate the e ective number of parameters.
2. Suppose that X_{1}; ; X_{n} are independent Gamma random variables with common density

f(x; ; ) =
x ^{1} exp( x)
for x > 0
( )
where > 0 and > 0 are unknown parameters.

The mean and variance of the Gamma distribution are = and = ^{2}, respectively. Use these to de ne method of moments estimates of and based on the sample mean and variance of the data x_{1}; ; x_{n}

Derive the likelihood equations for the MLEs of and and derive a NewtonRaphson
algorithm for computing the MLEs based on x_{1}; ; x_{n}. Implement this algorithm in R and test on data generated from a Gamma distribution (using the R function rgamma). Your function should also output an estimate of the variancecovariance matrix of the MLEs { this can be obtained from the Hessian of the loglikelihood function.
Important note: To implement the NewtonRaphson algorithm, you will need to compute the rst and second derivatives of ln ( ). These two derivatives are called (respectively) the digamma and trigamma functions, and these functions are available in R as digamma and trigamma; for example,

gamma(2) # gamma function evaluated at 2 [1] 1

digamma(2) # digamma function evaluated at 2 [1] 0.4227843

trigamma(2) # trigamma function evaluated at 2 [1] 0.6449341
2
3. Consider LASSO estimation in linear regression where we de ne b to minimize

n
p
X_{i}
(y_{i} y x_{i}^{T} )^{2} +j _{j}j
X
=1
j=1
for some > 0. (We assume that the predictors are centred and scaled to have mean 0 and variance 1, in which case y is the estimate of the intercept.) Suppose that the least squares estimate (i.e. for = 0) is nonunique  this may occur, for example, if there is some exact linear dependence in the predictors or if p > n. De ne
n
= min ^{X}(y_{i} y x^{T}_{i} )^{2}
i=1
and the set

C =
n
) _{:}
^{(} : (y_{i} y x_{i}^{T} )^{2} =
X_{i}
=1
We want to look at what happens to the LASSO estimate b as # 0.
(a) Show that b minimizes
^{(} 
n 

(y_{i} y x_{i}^{T} 

1 
X_{i} 

=1 

(b) Find the limit of 
( 

1 
n 

(y_{i} y 

=1 

X_{i} 

p
)^{2} + ^{X} j _{j}j:
j=1
)
x^{T}_{i} )^{2}
as # 0 as a function of . (What happens when 62 ?)C Use this to deduce that as # 0,
b 
b 
b 
p 

minimizes 
j^{X} 

! _{0} 
where _{0} 
j _{j}j on the set C. 

=1 

Show that b_{0} is the solution of a linear programming problem. (Hint: Note that C can be expressed in terms of satisfying p linear equations.)
4. Consider minimizing the function
g(x) = x^{2} 2 x + jxj
where > 0 and 0 < < 1. (This problem arises, in a somewhat more complicated form, in shrinkage estimation in regression.) The function jxj has a \cusp” at 0, which mean that if is su cient large then g is minimized at x = 0.
3
(a) g is minimized at x = 0 if, and only if,

2
“
2 2
#1
j j^{2} :
(1)
2
2
Otherwise, g is minimized at x satisfying g^{0}(x ) = 0. Using R, compare the following two iterative algorithms for computing x (when condition (1) does not hold):
(i) Set x_{0} = and de ne

x
=
jx_{k} _{1}j
k = 1; 2; 3;
k
^{2} ^{x}k 1
(ii) The NewtonRaphson algorithm with x_{0} = .
Use di erent values of , , and to test these algorithms. Which algorithm is faster?

Functions like g arise in socalled bridge estimation in linear regression (which are generalizations of the LASSO) { such estimation combines the features of ridge regression (which
shrinks least squares estimates towards 0) and model selection methods (which produce exact 0 estimates for some or all parameters). Bridge estimates b minimize (for some > 0
and > 0),

n
p
(y_{i} x_{i}^{T} )^{2} +
j _{j}j :
(2)
X_{i}
X
=1
j=1
See the paper by Huang, Horowitz and Ma (2008) (\Asymptotic properties of bridge estimators in sparse highdimensional regression models” Annals of Statistics. 36, 587{613) for details. Describe how the algorithms in part (a) could be used to de ne a coordinate descent algorithm to nd b minimizing (2) iteratively one parameter at a time.
(c) Prove that g is minimized at 0 if, and only if, condition (1) in part (a) holds.

Suppose that A is a symmetric nonnegative de nite matrix with eigenvalues _{1}_{2}
_{n} 0. Consider the following algorithm for computing the maximum eigenvalue _{1}:

Given x_{0}, de ne for k = 0; 1; 2; , x_{k+1} =
Ax_{k}
and _{k+1} =
^{x}_{k}^{T}_{+1}^{Ax}k+1
.
kAx_{k}k_{2}
^{x}_{k}^{T}_{+1}^{x}k+1
Under certain conditions, _{k} ! _{1}, the maximum eigenvalue of A; this algorithm is known as the power method and is particularly useful when A is sparse.
(a) Suppose that v_{1}; ; v_{n} are the eigenvectors of A corresponding to the eigenvalues _{1}; ; _{n}. Show that _{k} ! _{1} if x^{T}_{0} v_{1} 6= 0 and _{1} > _{2}.

What happens to the algorithm if if the maximum eigenvalue is not unique, that is, _{1} = _{2} = = _{k}?
4
GaussSeidel algorithm to estimate f _{i}g). Use both gradient descent and accelerated gradient descent to estimate f _{i}g. To nd an appropriate value of , it is useful to approximate the maximum eigenvalue of the Hessian matrix of the objective function { the algorithm in problem 5 is useful in this regard.
5