\$30.00

Category:

Description

Instructions: Solutions to problems 1 and 2 are to be submitted on Blackboard (PDF files
strongly preferred).
1. Consider the model
Yi = θi + “i (i = 1; · · · ; n)
where f“ig is a sequence of random variables with mean 0 and finite variance representing
noise. As in Assignment #2, we will assume that θ1; · · · ; θn are dependent or \smooth” in
the sense that the absolute differences fjθi–θi–1jg are small for most values of i. Rather than
penalizing a lack of smoothness by Pi(θi – θi–1)2 (as in Assignment #2), we will consider
estimating fθig given data fyig by minimizing
nX i
=1
(yi – θi)2 + λ
nX i
=2
jθi – θi–1j (1)
where λ > 0 is a tuning parameter. The resulting estimates θb1; · · · ; θbn, are called fusion
estimates.
The Gauss-Seidel algorithm used in Assignment #2 is effectively a coordinate descent algorithm. However, in computing the fusion estimates minimizing (1), application of the coordinate descent algorithm is frustrated to some extent by the fact that the non-differentiable
penalty in (1) is not separable. However, this can be easily fixed by defining φi = θi – θi–1
for i = 2; · · · ; n and then minimizing
nX i
=1
(yi – θi)2 + λ
nX i
=2
jφij (2)
where now each θi (for i = 2; · · · ; n) will be a function of θ1; φ2; · · · ; φi.
(a) Show that θk = θ1 + Pk i=2 φi for k ≥ 2.
(b) For fixed values of φ2; · · · ; φn, show that the objective function is minimized at
θ1 = 1
n
nX i
=1
0@
yi –
iX j
=2
φj
1A
:
(c) If we fix the values of θ1 and fφi : 2 ≤ i 6= j ≤ ng, show that the subgradient of the
objective function (2) with respect to φj is
–2
nX i=j
yi – θ1 –
iX k
=2
φk! + λ@jφjj = 2(n – j + 1)φj – 2 Xi=nj 0 @yi – θ1 – k=2; Xik6=j φk1 A + λ@jφjj
1
where
@jφjj =
8>><>>:
+1 if φj > 0
[–1; 1] if φj = 0
–1 if φj < 0
and hence the objective function is minimized over φj for fixed values of θ1 and fφi : 2 ≤
i 6= j ≤ ng at
φj = 0 if
nX i=j
0@
yi – θ1 –
iX
k=2;k6=j
φk1 A ≤ λ2
φj = 1
n – j + 1
8<:
nX i=j
0@
yi – θ1 –
iX
k=2;k6=j
φk1 A – λ2 9 = ; if Xi=nj 0 @yi – θ1 – k=2; Xik6=j φk1 A > λ2
φj = 1
n – j + 1
8<:
nX i=j
0@
yi – θ1 –
iX
k=2;k6=j
φk1 A + λ2 9 = ; if Xi=nj 0 @yi – θ1 – k=2; Xik6=j φk1 A < –λ2
(d) [BONUS – but highly recommended!] Write a function in R implementing the coordinate
descent algorithm suggested in parts (b) and (c) and test it on the data generated as in
Assignment #2. (Convergence of the coordinate descent algorithm here is very slow but you
should be able to produce good estimates of the underlying function.)
2. The Hidalgo stamp data is a (semi-)famous dataset containing thicknesses of 482 postage
stamps from the 1872 Mexican \Hidalgo” issue. It is believed that these stamps were printed
on different types of papers so that the data can be modeled as a \mixture” of several
distributions. The data are available in a file stamp.txt on Blackboard.
There is some consensus that a good model for these data is a mixture of normals; that is,
the density of the data is
f(x) =
mX k
=1
λkfk(x)
where fk is the density of a normal distribution with unknown mean and variance µk and σk2
respectively. fλkg are non-negative with λ1 + · · · + λm = 1.
(a) The \complete” data likelihood here is
L(µ1; · · · ; µm; σ12; · · · ; σm2 ; λ1; · · · ; λm) =
nY i
=1
mY k
=1
ffk(xi)uikλu kikg
where uik = 0 or 1 with ui1 + · · · + uim = 1 for i = 1; · · · ; n. Show that the complete data
MLEs are
µbk =
nX i
=1
uik!–1 Xi=1 n uikxi
2
σb2
k =
nX i
=1
uik!–1 Xi=1 n uik(xi – µbk)2
λbk = 1
n
nX i
=1
uik:
For the EM algorithm, we need to estimate the unobserved fuikg; for given fλkg and ffkg
(which depend on the means and variances), these estimates are
ubik = λkfk(xi)
λ1f1(xi) + · · · + λmfm(xi):
(You do not need to prove this.)
(b) Write an R function that uses the EM algorithm to compute the MLEs of λ1; · · · ; λm,
µ1; · · · ; µm and σ12; · · · ; σm2 . Your function should take as inputs initial estimates of these
unknown parameters. (A template function will be provided.)
(c) Using your function, estimate the parameters in the normal mixture model with m = 5; 6,
and 7 components. The success of the EM algorithm for mixture models depends on having
reasonable initial estimates of the parameters. One simple ad hoc approach is to use a kernel
density estimate where the bandwidth parameter is varied so that the estimate has the
appropriate number of modes (corresponding to the different components. This can be done
using the R function density | use the default kernel (which is a Gaussian kernel) and
change the bandwidth using the parameter bw; for example, if the data are in stamp then
plot(density(stamp,bw=0.002)) will give a plot of the density estimate for a bandwidth
of 0:002.
(d) Which of the 3 models considered in part (b) do you prefer and why?
3