$30.00
Description
Instructions
Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named ‘Theory Assignment 4’ set up in which you can submit your files. Please be sure to follow all directions outlined here.
You can submit multiple times, but only the last submission counts. That means if you finish some problems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (remember Murphy’s Law), you still get credit for whatever you have turned in.
Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submission must be a single PDF. It is strongly recommended that you typeset with L^{A}T_{E}X. There are many free integrated L^{A}T_{E}X editors that are convenient to use (e.g Overleaf, ShareLaTeX). Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not know how to use L^{A}T_{E}X yet.
Please also follow the rules below:

The file should be named as firstname lastname USCID.pdf e.g., Don Quijote de la Mancha 8675309045.pdf).

Do not have any spaces in your file name when uploading it.

Please include your name and USCID in the header of your report as well.
Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration. Review the rules on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly more than you could stand to gain.
Notes on notation:

Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

k.k means L2norm unless specified otherwise i.e. k.k = k.k_{2}

Please always reduce your solution to the simplest form.
1
Problem 1 Optimization over the simplex (25 points)
Many machine learning problems, especially clustering problems, involve optimization over a simplex. Thus, in this exercise, you will prove two optimization results over the simplex that we used multiple times in the lectures. Recall that a K 1 dimensional simplex D is defined as
K

= fq 2 R^{K} j q_{k} 0, 8k and å q_{k} = 1g,
k=1
which means that a K 1 dimensional simplex has K nonnegative entries, and the sum of all K entries is

The property that a simplex sums to one coincides with a probability distribution of a discrete random variable of K possible outcomes, and is thus frequently used in clustering problems.
1.1 Given a_{1}, . . . , a_{K} 2 R^{+}, solve the following optimization over simplex (find the optimal value of q_{k}.)
(12 points)
K
arg max å a_{k} ln q_{k},

k=1
s.t. q_{k} 0,
K 

å q_{k} = 1. 

k=1 

1.2 Given b_{1}, . . . , b_{K} 2 R, solve the following optimization problem 
(13 points) 
K 

arg max å (q_{k}b_{k} q_{k} ln q_{k}) , 

^{q}2D k=1 
s.t. q_{k} 0,
K

q_{k} = 1.
k=1
2
Problem 2 Gaussian Mixture Model and EM (30 points)
2.1 In the lecture we applied EM to learn Gaussian Mixture Models (GMMs) and showed the MStep without a proof. Now it is time that you prove it.
Consider a GMM with the following PDF of x_{n}:

K
N
j
K
(
^{p}2_{p})^{D}jS_{k}j2
2
k=1
k=1
p(x_{n}) = å w_{k}
(x_{n}
m_{k}, S_{k}) = å
w_{k}
exp
1
(x_{n}
m_{k})^{T} S_{k}
^{1}(x_{n} m_{k})
1
where K 2 N is the number of Gaussian components, D 2 N is dimension of a data point x_{n}. This GMM has K tuples of model parameters (m_{k}, S_{k}, w_{k}), which standards for the mean vector, covariance matrix, and component weight of the kth Gaussian component. jSj denotes the determinant of matrix S.
For simplicity, we further assume that all components are isotropic Gaussian, i.e., S_{k} = s_{k}^{2} I. Find the MLE of the expected complete loglikelihood. Equivalently, find the optimal solution to the following optimization problem.
arg max ååg_{nk} ln w_{k} + ååg_{nk} ln N (x_{n} j m_{k}, S_{k})
w_{k},m_{k},S_{k} n _{k} n _{k}
s.t. w_{k} 0
K

w_{k} = 1
k=1
where g_{nk} is the posterior of latent variables computed from the EStep. Hint: you can make use of the
result from Problem 1.1. (15 points)
2.2 In the lecture we derived EM through a lower bound of the loglikelihood function. Specifically, we find the tightest lower bound by solving

i
arg max E_{z}_{n} _{q}_{n} ln p(x_{n}, z_{n} ; q^{(}^{t}^{)}) E_{z}_{n} _{q}_{n} [ln q_{n}] .
q_{n} 2D
Find the optimal solution to this optimization problem using the results from Problem 1.2. (The solution was already given in the lecture, but here we require you to derive it.)
(5 points)
2.3 The posterior probability of z in GMM can be seen as a soft assignment to the clusters; In contrast, Kmeans assign each data point to one cluster at each iteration (hard assignment). Describe how to set the model parameters such that the GMM in the previous question reduces to a Kmeans; please also derive p(z_{n} = kjx_{n}) in this case.
(10 points)
3
Problem 3 Hidden Markov Model (45 points)
Recall that an HMM is parameterized as follows:

initial state distribution P(X_{1} = s) = p_{s}

transition distribution P(X_{t}_{+}_{1} = s^{0} j X_{t} = s) = a_{s,s}0

emission distribution P(O_{t} = o j X_{t} = s) = b_{s,o}
3.1 Suppose we observe a sequence of outcomes o_{1}, . . . , o_{T} and wish to predict the next state X_{T}_{+}_{1}
P(X_{T}_{+}_{1} = s j O_{1:T} = o_{1:T} ).
Denote the forward message as
a_{s}(T) = P(X_{T} = s, O_{1:T} = o_{1:T} ).
Please derive how P(X_{T}_{+}_{1} = s j O_{1:T} = o_{1:T} ) can be represented using a_{s}(T).
3.2 Given an HMM with the following probabilities:

Transition probabilities:
Emission probabilities:
Current
Next
Word
Start
A
B
End
Start
0
0.7
0.3
0
State:
‘’
‘ f ight^{0}
‘on^{0}
A
0
0.2
0.7
0.1
Start
1
0
0
B
0
0.7
0.2
0.1
A
0
0.4
0.6
End
0
0
0
1
B
0
0.7
0.3
We assume that the process stops at state ‘End’.

Suppose that the process starts from state ‘Start’ at t = 0 and that we observe o_{1:2} = f ight on , write down the forward messages a_{s}(2) and determine the most likely state at t = 3 by computing the probability for each state. Round your numbers to 4 decimal points.
(7 points)

Suppose that the process starts from state ‘Start’ at t = 0 and that we observe the whole output sequence as f ight on on , what is the most likely sequence of states that produce this? Please show your derivation step by step.
(8 points)
3.3 [Unrelated to the above.] Describe how to set the model parameters such that an HMM reduces to the GMM described in Problem 2. In addition, derive the posterior probability P(X_{2} = s j O_{1} = o_{1}, O_{2} = o_{2}) using the parameters you set to show that the HMM really reduces to GMM. Hint: compare your result
with the posterior p(zjx) of GMM in Problem 2.2. (20 points)
4