Theory Assignment 4 Solution

$30.00

Description

Instructions

Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named ‘Theory Assignment 4’ set up in which you can submit your files. Please be sure to follow all directions outlined here.

You can submit multiple times, but only the last submission counts. That means if you finish some prob-lems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (re-member Murphy’s Law), you still get credit for whatever you have turned in.

Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submis-sion must be a single PDF. It is strongly recommended that you typeset with LATEX. There are many free integrated LATEX editors that are convenient to use (e.g Overleaf, ShareLaTeX). Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not know how to use LATEX yet.

Please also follow the rules below:

  • The file should be named as firstname lastname USCID.pdf e.g., Don Quijote de la Mancha 8675309045.pdf).

  • Do not have any spaces in your file name when uploading it.

  • Please include your name and USCID in the header of your report as well.

Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration. Review the rules on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly more than you could stand to gain.

Notes on notation:

  • Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

  • k.k means L2-norm unless specified otherwise i.e. k.k = k.k2

  • Please always reduce your solution to the simplest form.

1

Problem 1 Optimization over the simplex (25 points)

Many machine learning problems, especially clustering problems, involve optimization over a simplex. Thus, in this exercise, you will prove two optimization results over the simplex that we used multiple times in the lectures. Recall that a K 1 dimensional simplex D is defined as

K

  • = fq 2 RK j qk 0, 8k and å qk = 1g,

k=1

which means that a K 1 dimensional simplex has K non-negative entries, and the sum of all K entries is

  1. The property that a simplex sums to one coincides with a probability distribution of a discrete random variable of K possible outcomes, and is thus frequently used in clustering problems.

1.1 Given a1, . . . , aK 2 R+, solve the following optimization over simplex (find the optimal value of qk.)

(12 points)

K

arg max å ak ln qk,

  • k=1

s.t. qk 0,

K

å qk = 1.

k=1

1.2 Given b1, . . . , bK 2 R, solve the following optimization problem

(13 points)

K

arg max å (qkbk qk ln qk) ,

q2D k=1

s.t. qk 0,

K

  • qk = 1.

k=1

2

Problem 2 Gaussian Mixture Model and EM (30 points)

2.1 In the lecture we applied EM to learn Gaussian Mixture Models (GMMs) and showed the M-Step without a proof. Now it is time that you prove it.

Consider a GMM with the following PDF of xn:

K

N

j

K

(

p2p)DjSkj2

2

k=1

k=1

p(xn) = å wk

(xn

mk, Sk) = å

wk

exp

1

(xn

mk)T Sk

1(xn mk)

1

where K 2 N is the number of Gaussian components, D 2 N is dimension of a data point xn. This GMM has K tuples of model parameters (mk, Sk, wk), which standards for the mean vector, covariance matrix, and component weight of the k-th Gaussian component. jSj denotes the determinant of matrix S.

For simplicity, we further assume that all components are isotropic Gaussian, i.e., Sk = sk2 I. Find the MLE of the expected complete log-likelihood. Equivalently, find the optimal solution to the following optimiza-tion problem.

arg max åågnk ln wk + åågnk ln N (xn j mk, Sk)

wk,mk,Sk n k n k

s.t. wk 0

K

  • wk = 1

k=1

where gnk is the posterior of latent variables computed from the E-Step. Hint: you can make use of the

result from Problem 1.1. (15 points)

2.2 In the lecture we derived EM through a lower bound of the log-likelihood function. Specifically, we find the tightest lower bound by solving

  • i

arg max Ezn qn ln p(xn, zn ; q(t)) Ezn qn [ln qn] .

qn 2D

Find the optimal solution to this optimization problem using the results from Problem 1.2. (The solution was already given in the lecture, but here we require you to derive it.)

(5 points)

2.3 The posterior probability of z in GMM can be seen as a soft assignment to the clusters; In contrast, K-means assign each data point to one cluster at each iteration (hard assignment). Describe how to set the model parameters such that the GMM in the previous question reduces to a K-means; please also derive p(zn = kjxn) in this case.

(10 points)

3

Problem 3 Hidden Markov Model (45 points)

Recall that an HMM is parameterized as follows:

  • initial state distribution P(X1 = s) = ps

  • transition distribution P(Xt+1 = s0 j Xt = s) = as,s0

  • emission distribution P(Ot = o j Xt = s) = bs,o

3.1 Suppose we observe a sequence of outcomes o1, . . . , oT and wish to predict the next state XT+1

P(XT+1 = s j O1:T = o1:T ).

Denote the forward message as

as(T) = P(XT = s, O1:T = o1:T ).

Please derive how P(XT+1 = s j O1:T = o1:T ) can be represented using as(T).

3.2 Given an HMM with the following probabilities:

Transition probabilities:

Emission probabilities:

Current

Next

Word

Start

A

B

End

Start

0

0.7

0.3

0

State:

‘’

f ight0

on0

A

0

0.2

0.7

0.1

Start

1

0

0

B

0

0.7

0.2

0.1

A

0

0.4

0.6

End

0

0

0

1

B

0

0.7

0.3

We assume that the process stops at state ‘End’.

  1. Suppose that the process starts from state ‘Start’ at t = 0 and that we observe o1:2 = f ight on , write down the forward messages as(2) and determine the most likely state at t = 3 by computing the probability for each state. Round your numbers to 4 decimal points.

(7 points)

  1. Suppose that the process starts from state ‘Start’ at t = 0 and that we observe the whole output se-quence as f ight on on , what is the most likely sequence of states that produce this? Please show your derivation step by step.

(8 points)

3.3 [Unrelated to the above.] Describe how to set the model parameters such that an HMM reduces to the GMM described in Problem 2. In addition, derive the posterior probability P(X2 = s j O1 = o1, O2 = o2) using the parameters you set to show that the HMM really reduces to GMM. Hint: compare your result

with the posterior p(zjx) of GMM in Problem 2.2. (20 points)

4