$30.00
Description
Instructions: Solutions to problems 1 and 2 are to be submitted on Blackboard (PDF les only). You are strongly encouraged to do problems 3{5 but these are not to be submitted for grading.
1. In lecture, we showed that we could estimate the trace of a matrix A (tr(A)) by
tr(bA) =
1
^{m} i=1
where V _{1}; ; V _{m} are independent random vectors with E[V _{i}V ^{T}_{i} ] = I.

Suppose that the elements of each V _{i} are independent, identically distribution random
variables with mean 0 and variance 1. Show that Var(tr(bA) is minimized by taking the elements of V _{i} to be 1 each with probability 1=2.
Hint: This is easier than it looks { Var(V ^{T} AV ) = E[(V ^{T} AV )^{2}] tr(A)^{2} so it su ces to
minimize 
n n 
n n 

E[(V ^{T} AV )^{2}] = 
a_{ij}a_{k‘}E(V_{i}V_{j}V_{k}V_{‘}): 
XXXX
i=1 j=1 k=1 ‘=1
Given our conditions on the elements of V _{i}, V_{1}; ; V_{n}, most of E(V_{i}V_{j}V_{k}V_{‘}) are either 0 or 1. You should be able to show that
n
E[(V ^{T} AV )^{2}] = ^{X} a^{2}_{ii}E(V_{i}^{4}) + constant
i=1
and nd V_{i} to minimize E(V_{i}^{4}) subject to E(V_{i}^{2}) = 1.
(b) We also noted that if B is a symmetric n n matrix whose eigenvalues are less than 1 in absolute value then we have the following formula for the determinate of the matrix I B:
det(I 
B) = exp 
^{1} 1 
_{tr(B}k_{)}^{!} 

=1 
k 

k^{X} 

and so we can estimate this determinant by 

I B) = exp 
1 
m 
V _{i}^{T} BV _{i} + 
1 
1 
^{!} 

^{m} i=1 
_{2}V _{i}^{T} B^{2}V _{i} + + _{r} V _{i}^{T} B^{r}V _{i} 

det( 
X 
where r is \large enough”.
1
In linear regression, one measure of the leverage of a subset of ‘ observations (x_{1}; y_{1}); ; (x_{‘}; y_{‘}) is 1 det(I H_{11}) where H_{11} is an ‘ ‘ matrix de ned in terms of the linear regression \hat” matrix as follows:
0 1

_{=} _{X(X}^{T} _{X)} ^{1}_{X}^{T} _{=} @ ^{H}^{11} ^{H}^{12} A_{:}
^{H}21 ^{H}22
From above, we can estimate det(I 
H_{11}) by 

I H_{11}) = exp 
1 ^{m} 
V _{i}^{T} H_{11}V _{i} + 
1 
1 
! 

^{m} i=1 
_{2}V _{i}^{T} H_{11}^{2} 
V _{i} + + _{r} V _{i}^{T} H_{11}^{r}V _{i} 

det( 
X 
for some r.
Show that we can compute H_{11}V and H_{11}^{k}V for k 2 using the hat matrix H as follows:

H
0
V
1
=
0
H_{11}V
1
@
A
@
A

H_{21}V
and

H
^{0} _{H}_{11}k 1_{V} ^{1}
=
0
^{H}11^{k}_{k}^{V}_{1}
1
:
@
0
A
@
^{H}21^{H}11
V ^{A}
(c) On Quercus, there is a function leverage (in the le leverage.txt) that computes the leverage for a given subset of observations for a design matrix X. (This function uses the QR decomposition of X to compute HV _{i}; the functions are qr, which computes the QR decomposition of X and qr.fitted. which computes Hy = QQ^{T} y.)
Suppose that y_{i} = g(x_{i}) + “_{i} for i = 1; ; n for some smooth function g and consider the following two parametric models for g:
g_{1}(x) = _{0} + _{1}x + + _{10}x^{10}
and
g_{2}(x) = _{0} + _{1 1}(x) + + _{10 10}(x)
where _{1}(x); ; _{10}(x) are Bspline functions. Suppose that x_{1}; ; x_{1000} are equally spaced points on [0; 1] with x_{i} = i=1000. The Bspline functions and the respective design matrices can be constructed using the following R code:

x < c(1:1000)/1000

X1<1

for (i in 1:10) X1 < cbind(X1,x^i)

library(splines) # loads the library of functions to compute Bsplines

X2 < cbind(1,bs(x,df=10))
2
Note that both X1 and X2 are 1000 11 matrices. You can see the Bspline functions _{1}(x); ; _{10}(x) as follows:

plot(x,X2[,2])

for (i in 3:11) points(x,X2[,i])
Estimate the leverage of the points fx_{i} : x_{i} 0:1g for both designs; for which design do these points have greater leverage? To estimate the leverages for the two designs, you may want to modify the function leverage to estimate the leverages for both designs using the same values of V _{1}; ; V _{m} (why?).
2. Suppose that X_{1}; ; X_{n} are independent Gamma random variables with common density
x ^{1} exp( 
x) 

f(x; ; ) = 
for x > 0 

) 

where > 0 and > 0 are unknown parameters. 

The mean and variance of the Gamma distribution are = and = ^{2}, respectively. Use these to de ne method of moments estimates of and based on the sample mean and variance of the data x_{1}; ; x_{n}

Derive the likelihood equations for the MLEs of and and derive a NewtonRaphson algorithm for computing the MLEs based on x_{1}; ; x_{n}. Implement this algorithm in R and test on data generated from a Gamma distribution (using the R function rgamma). Your function should also output an estimate of the variancecovariance matrix of the MLEs { this can be obtained from the Hessian of the loglikelihood function.
Important note: To implement the NewtonRaphson algorithm, you will need to compute the rst and second derivatives of ln ). These two derivatives are called (respectively) the digamma and trigamma functions, and these functions are available in R as digamma and trigamma; for example,

gamma(2) # gamma function evaluated at 2 [1] 1

digamma(2) # digamma function evaluated at 2 [1] 0.4227843

trigamma(2) # trigamma function evaluated at 2 [1] 0.6449341
3
3. Consider LASSO estimation in linear regression where we de ne b to minimize

n
p
(y_{i}
y x_{i}^{T} )^{2} +j _{j}j
X_{i}
X
=1
j=1
for some > 0. (We assume that the predictors are centred and scaled to have mean 0 and variance 1, in which case y is the estimate of the intercept.) Suppose that the least squares estimate (i.e. for = 0) is nonunique  this may occur, for example, if there is some exact linear dependence in the predictors or if p > n. De ne
n 

= min 
X_{i} 
(y_{i} y x_{i}^{T} )^{2} 

=1 

and the set 
n 
y x_{i}^{T} )^{2} = ^{)} : 

C = 

^{(} : (y_{i} 

X_{i} 

=1 
We want to look at what happens to the LASSO estimate b as # 0.
(a) Show that b minimizes

^{(}
n
y x_{i}^{T} )^{2} ^{)}
p
(y_{i}
+ j _{j}j:
1
X_{i}
X
=1
j=1
(b) Find the limit of

(
n
^{)}
(y_{i} y x_{i}^{T} )^{2}
1
X_{i}
=1
as # 0 as a function of . (What happens when 62 ?)C Use this to deduce that as # 0,
b 
b 
b 
p 

minimizes 
j^{X} 

! _{0} 
where _{0} 
j _{j}j on the set C. 

=1 

Show that b_{0} is the solution of a linear programming problem. (Hint: Note that C can be expressed in terms of satisfying p linear equations.)
4. Consider minimizing the function
g(x) = x^{2} 2 x + jxj
where > 0 and 0 < < 1. (This problem arises, in a somewhat more complicated form, in shrinkage estimation in regression.) The function jxj has a \cusp” at 0, which mean that if is su cient large then g is minimized at x = 0.
4
(a) g is minimized at x = 0 if, and only if,
^{“} ^{#}^{1}
^{2} ^{2} ^{2} _{j j}2 _{:} _{(1)}
Otherwise, g is minimized at x satisfying g^{0}(x ) = 0. Using R, compare the following two iterative algorithms for computing x (when condition (1) does not hold):
(i) Set x_{0} = and de ne

x
=
jx_{k} _{1}j
k = 1; 2; 3;
k
^{2} ^{x}k 1
(ii) The NewtonRaphson algorithm with x_{0} = .
Use di erent values of , , and to test these algorithms. Which algorithm is faster?

Functions like g arise in socalled bridge estimation in linear regression (which are generalizations of the LASSO) { such estimation combines the features of ridge regression (which
shrinks least squares estimates towards 0) and model selection methods (which produce exact 0 estimates for some or all parameters). Bridge estimates b minimize (for some > 0
and > 0),

n
p
(y_{i} x_{i}^{T} )^{2} +
j _{j}j :
(2)
X_{i}
X
=1
j=1
See the paper by Huang, Horowitz and Ma (2008) (\Asymptotic properties of bridge estimators in sparse highdimensional regression models” Annals of Statistics. 36, 587{613) for details. Describe how the algorithms in part (a) could be used to de ne a coordinate descent algorithm to nd b minimizing (2) iteratively one parameter at a time.
(c) Prove that g is minimized at 0 if, and only if, condition (1) in part (a) holds.

Suppose that A is a symmetric nonnegative de nite matrix with eigenvalues _{1}_{2}
_{n} 0. Consider the following algorithm for computing the maximum eigenvalue _{1}:

Given x_{0}, de ne for k = 0; 1; 2; , x_{k+1} =
Ax_{k}
and _{k+1} =
^{x}_{k}^{T}_{+1}^{Ax}k+1
.
kAx_{k}k_{2}
^{x}_{k}^{T}_{+1}^{x}k+1
Under certain conditions, _{k} ! _{1}, the maximum eigenvalue of A; this algorithm is known as the power method and is particularly useful when A is sparse.
(a) Suppose that v_{1}; ; v_{n} are the eigenvectors of A corresponding to the eigenvalues _{1}; ; _{n}. Show that _{k} ! _{1} if x^{T}_{0} v_{1} 6= 0 and _{1} > _{2}.

What happens to the algorithm if if the maximum eigenvalue is not unique, that is, _{1} = _{2} = = _{k}?
5