Name: Assignment #3 STA410H1F/2102H1F SOlution
SKU: 3556
Price: 35.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Instructions: Solutions to problems 1 and 2 are to be submitted on Blackboard (PDF les strongly preferred). You are strongly encouraged to do problems 3{6 but these are not to be submitted for grading.

Suppose that S is an n n matrix where n may be very large and the elements of S may not be explicitly de ned. We are interested in approximating the trace of S, that is, the sum

of its diagonal elements. For example, if S is a smoothing matrix in regression (yb = Sy) then the trace of S gives a measure of the e ective number of parameters using in the smoothing method. (In multiple regression models, the smoothing matrix is the projection matrix X(X^T X) ¹X^T whose trace is the number of columns of X.)

Show that if A and B are m n and n m matrices, respectively, then tr(AB) = tr(BA). (This is a well-known fact but humour me with a proof!)

Suppose that V is a random vector of length n such that E[V V ^T ] = I. If S is an n n

non-random matrix, show that

h i h i h i

E V ^T SV = E tr SV V ^T = tr SE V V ^T = tr(S)

and so tr(S) can be estimated by

1 ^m

X _T

^tr(^d^{S) =} _m _i=1 ^V i ^SV ⁱ

where V ₁; ; V _m are independent random vectors with E[V _iV ^T_i ] = I.

Suppose that the elements of each V _i are independent, identically distribution random variables with mean 0 and variance 1. Show that Var(tr(dS) is minimized by taking the elements of V _i to be 1 each with probability 1=2.

Hint: This is easier than it looks { Var(V ^T SV ) = E[(V ^T SV )²] tr(S)² so it su ces to

minimize	n n
n n	n n
E[(V ^T SV )²] =	s_ijs_k`E(V_iV_jV_kV_`):

X X X X

i=1 j=1 k=1 `=1

Given our conditions on the elements of V _i, V₁; ; V_n, most of E(V_iV_jV_kV_`) are either 0 or 1. You should be able to show that

E[(V ^T SV )²] = ^X s²_iiE(V_i⁴) + constant

i=1

and nd V_i to minimize E(V_i⁴) subject to E(V_i²) = 1.

(d) Suppose we estimate the function g in the non-parametric regression model

y_i = g(x_i) + “_i for i = 1; ; n

using loess (i.e. the R function loess) where the smoothness is determined by the parameter span lying between 0 and 1. Given a set of predictors fx_ig and a value of span, write an R function to approximate the e ective number of parameters.

Suppose that X₁; ; X_n are independent Gamma random variables with common density

f(x; ; ) =	x ¹ exp( x)	for x > 0
	( )

where > 0 and > 0 are unknown parameters.

The mean and variance of the Gamma distribution are = and = ², respectively. Use these to de ne method of moments estimates of and based on the sample mean and variance of the data x₁; ; x_n

Derive the likelihood equations for the MLEs of and and derive a Newton-Raphson

algorithm for computing the MLEs based on x₁; ; x_n. Implement this algorithm in R and test on data generated from a Gamma distribution (using the R function rgamma). Your function should also output an estimate of the variance-covariance matrix of the MLEs { this can be obtained from the Hessian of the log-likelihood function.

Important note: To implement the Newton-Raphson algorithm, you will need to compute the rst and second derivatives of ln ( ). These two derivatives are called (respectively) the digamma and trigamma functions, and these functions are available in R as digamma and trigamma; for example,

gamma(2) # gamma function evaluated at 2 [1] 1

digamma(2) # digamma function evaluated at 2 [1] 0.4227843

trigamma(2) # trigamma function evaluated at 2 [1] 0.6449341

Supplemental problems:

Consider LASSO estimation in linear regression where we de ne b to minimize

n	p
X_i	(y_i y x_i^T )² +j _jj
X_i	X
=1	j=1

for some > 0. (We assume that the predictors are centred and scaled to have mean 0 and variance 1, in which case y is the estimate of the intercept.) Suppose that the least squares estimate (i.e. for = 0) is non-unique | this may occur, for example, if there is some exact linear dependence in the predictors or if p > n. De ne

= min ^X(y_i y x^T_i )²

i=1

and the set

C =	n	) _:
C =	⁽ : (y_i y x_i^T )² =	) _:
	X_i
	=1

We want to look at what happens to the LASSO estimate b as # 0.

(a) Show that b minimizes

⁽	n
⁽		(y_i y x_i^T
1	X_i
	X_i
	=1
(b) Find the limit of			(
	1			n
	1			(y_i y


				=1
				X_i

)² + ^X j _jj:

j=1

)

x^T_i )²

as # 0 as a function of . (What happens when 62 ?)C Use this to deduce that as # 0,

b	b	b		p
b	b	b	minimizes	j^X
	! ₀	where ₀	minimizes	j _jj on the set C.
				=1

Show that b₀ is the solution of a linear programming problem. (Hint: Note that C can be expressed in terms of satisfying p linear equations.)

Consider minimizing the function

g(x) = x² 2 x + jxj

where > 0 and 0 < < 1. (This problem arises, in a somewhat more complicated form, in shrinkage estimation in regression.) The function jxj has a \cusp” at 0, which mean that if is su cient large then g is minimized at x = 0.

(a) g is minimized at x = 0 if, and only if,

	2	“	2 2	#1	j j² :	(1)
2			2

Otherwise, g is minimized at x satisfying g⁰(x ) = 0. Using R, compare the following two iterative algorithms for computing x (when condition (1) does not hold):

(i) Set x₀ = and de ne

x		=		jx_k ₁j	k = 1; 2; 3;
	k		² ^xk 1

(ii) The Newton-Raphson algorithm with x₀ = .

Use di erent values of , , and to test these algorithms. Which algorithm is faster?

Functions like g arise in so-called bridge estimation in linear regression (which are gener-alizations of the LASSO) { such estimation combines the features of ridge regression (which

shrinks least squares estimates towards 0) and model selection methods (which produce ex-act 0 estimates for some or all parameters). Bridge estimates b minimize (for some > 0

and > 0),

n	p
(y_i x_i^T )² +	j _jj :	(2)
X_i	X
=1	j=1

See the paper by Huang, Horowitz and Ma (2008) (\Asymptotic properties of bridge esti-mators in sparse high-dimensional regression models” Annals of Statistics. 36, 587{613) for details. Describe how the algorithms in part (a) could be used to de ne a coordinate descent algorithm to nd b minimizing (2) iteratively one parameter at a time.

Suppose that A is a symmetric non-negative de nite matrix with eigenvalues ₁₂

_n 0. Consider the following algorithm for computing the maximum eigenvalue ₁:

Given x₀, de ne for k = 0; 1; 2; , x_k+1 =	Ax_k	and _k+1 =	^x_k^T₊₁^Axk+1		.
Given x₀, de ne for k = 0; 1; 2; , x_k+1 =	kAx_kk₂	and _k+1 =		^x_k^T₊₁^xk+1	.

Under certain conditions, _k ! ₁, the maximum eigenvalue of A; this algorithm is known as the power method and is particularly useful when A is sparse.

(a) Suppose that v₁; ; v_n are the eigenvectors of A corresponding to the eigenvalues ₁; ; _n. Show that _k ! ₁ if x^T₀ v₁ 6= 0 and ₁ > ₂.

What happens to the algorithm if if the maximum eigenvalue is not unique, that is, ₁ = ₂ = = _k?

Consider the estimation procedure in problem 2 of Assignment #2 (where we used the

Gauss-Seidel algorithm to estimate f _ig). Use both gradient descent and accelerated gradient descent to estimate f _ig. To nd an appropriate value of , it is useful to approximate the maximum eigenvalue of the Hessian matrix of the objective function { the algorithm in problem 5 is useful in this regard.

Assignment #3 STA410H1F/2102H1F SOlution

Description

Related products

Homework 3: Mbed Setup Solution

Homework 5: Minion Agents Solution

Project 5: Subdivision Surfaces

Project 3A: Ray Tracing Spheres Solution

Project 2A: Object Modeling Solution