$30.00
Description
Objectives:
In this assignment, you will implement learning and inference procedures for some of the probabilistic models described in class, apply your solutions to some simulated datasets, and analyze the results.
General Note:

Full points are given for complete solutions, including justifying the choices or assumptions you made to solve the question. Both complete source code and program outputs should be included in the nal submission.

Homework assignments are to be solved in the assigned groups of two. You are encouraged to discuss the assignment with other students, but you must solve it within your own group. Make sure to be closely involved in all aspects of the assignment.

There are 3 starter les attached, helper.py, starter kmeans.py and starter gmm.py which will help you with your implementation.

Kmeans [9 pt.]
Kmeans clustering is one of the most widely used data analysis algorithms. It is used to summarize data by discovering a set of data prototypes that represent clusters of data. The data prototypes are usually referred to as cluster centers. Usually, Kmeans clustering proceeds by alternating between assigning data points to clusters and then updating the cluster centers. In this assignment, we will investigate a di erent learning algorithm that directly minimizes the Kmeans clustering loss function.
1
1.1 Learning Kmeans 2 MIXTURES OF GAUSSIANS [16 PT.]
1.1 Learning Kmeans
The K cluster centers can be thought of as K, Ddimensional parameter vectors and we can place them in a K D parameter matrix , where the k^{th} row of the matrix denotes the k^{th} cluster center _{k}. The goal of Kmeans clustering is to learn such that it minimizes the loss
function, L( ) = 
N 
_{k}k_{2}^{2}, where N is the number of the training observations. 

_{n=1} ^{min}_{k}^{K}_{=1} k^{x}n 

Even though the 
loss function is not smooth due to the \min” operation, one may still be able 

P 
to nd its solutions through iterative gradientbased optimization. The \min” operation leads to discontinuous derivatives, in a way that is similar to the e ect of the ReLU activation function, but nonetheless, a good gradientbased optimizer can work e ectively.

Implement the distanceFunc() function in the starter kmeans.py le to calculate the squared pairwise distance for all pair of N data points and K clusters.
For the dataset data2D.npy, set K = 3 and nd the Kmeans clusters by minimizing the L( ) using the gradient descent optimizer. The parameters should be initialized by sampling from the standard normal distribution. Include a plot of the loss vs the number of updates. Hints: you may want to use the Adam optimizer for this assignment with following hyperparameter tf.train.AdamOptimizer(learning rate=0.1, beta1=0.9, beta2=0.99, epsilon=1e5). The learning should converge within a few hundred updates.

Run the algorithm with K = 1; 2; 3; 4; 5 and for each of these values of K, compute and report the percentage of the data points belonging to each of the K clusters. Comment on how many clusters you think is \best” and why? (To answer this, it may be helpful discuss this value in the context of a 2D scatter plot of the data.) Include the 2D scatter plot of data points colored by their cluster assignments.

Hold 1/3 of the data out for validation. For each value of K above, cluster the training data and then compute and report the loss for the validation data. How many clusters do you think is best?

Mixtures of Gaussians [16 pt.]
Mixtures of Gaussians (MoG) can be interpreted as a probabilistic version of Kmeans clustering. For each data vector, MoG uses a latent variable z to represent the cluster assign
ment and uses a joint probability model of the cluster assignment variable and the data vector: P (x; z) = P (z)P (x j z). For N IID training cases, we have P (X; z) = ^{Q}^{N}_{n=1} P (x_{n}; z_{n}). The ExpectationMaximization (EM) algorithm is the most commonly used technique to learn a MoG.
Like the standard Kmeans clustering algorithm, the EM algorithm alternates between updating the cluster assignment variables and the cluster parameters. What makes it di erent is that instead of making hard assignments of data vectors to cluster centers (the \min” operation above),
the EM algorithm computes probabilities for di erent cluster centers, P (zjx). These are computed from P (z = kjx) = P (x; z = k)= ^{P}^{K}_{j=1} P (x; z = j).
2
2.1 The Gaussian cluster mode [7 pt.] 2 MIXTURES OF GAUSSIANS [16 PT.]
While the ExpectationMaximization (EM) algorithm is typically the goto learning algorithm to train MoG and is guaranteed to converge to a local optimum, it su ers from slow convergence. In this assignment, we will explore a di erent learning algorithm that makes use of gradient descent.
2.1 The Gaussian cluster mode [7 pt.]
Each of the K mixture components in the MoG model occurs with probability ^{k} = P (z = k). The data model is a multivariate Gaussian distribution centered at the cluster mean (data center) ^{k} 2 R^{D}. We will consider a MoG model where it is assumed that for the multivariate Gaussian for cluster k, di erent data dimensions are independent and have the same standard deviation, ^{k}.

Modify the Kmeans distance function implemented in 1.1 to compute the log probability density function for cluster k: log N (x ; ^{k}; ^{k2}) for all pair of N data points and K clusters. Include the snippets of the Python code.

Write a vectorized Tensor ow Python function that computes the log probability of the cluster variable z given the data vector x: log P (zjx). The log Gaussian pdf function implemented above should come in handy. The implementation should use the function reduce logsumexp() provided in the helper functions le. Include the snippets of the Python code and comment on why it is important to use the logsumexp function instead of using tf.reduce sum.
2.2 Learning the MoG [9 pt.]
The marginal data likelihood for the MoG model is as follows (here \marginal” refers to summing over the cluster assignment variables):

N K

Y X

P(X) =
P (x_{n}) =
P (z_{n} = k)P (x_{n} j z_{n} = k)
n=1
n=1 k=1
= ^{Y}_{n}
X_{k}
^{k}N (x_{n} ; ^{k}; ^{k2})
The loss function we will minimize is the negative log likelihood L( ; ; ) = log P (X). The maximum likelihood estimate (MLE) is a set of the model parameters ; ; that maximize the log likelihood or, equivalently, minimize the negative log likelihood.
1. Implement the loss function using logsumexp function and perform MLE by directly optimizing the log likelihood function using gradient descent in Tensor ow. Note that the standard deviation has the constraint of 2 [0; 1). One way to deal with this constraint is to replace ^{2} with exp( ) in the math and the software, where is an unconstrained parameter. In addition, has a simplex constraint, that is ^{P}_{k} ^{k} = 1. We can again replace this constrain with unconstrained parameter through a softmax function ^{k} = exp( ^{k})= ^{P}_{k}_{0} exp( ^{k}^{0} ). A logsoftmax function, logsoftmax, is provided for convenience in the helper functions le. For the dataset data2D.npy, set K = 3 and report the best model parameters it has learnt. Include a plot of the loss vs the number of updates.
3
2.2 Learning the MoG [9 pt.] 2 MIXTURES OF GAUSSIANS [16 PT.]

Hold out 1/3 of the data for validation and for each value of K = f1; 2; 3; 4; 5g, train a MoG model. For each K, compute and report the loss function for the validation data and explain which value of K is best. Include a 2D scatter plot of data points colored by their cluster assignments.

Run both the Kmeans and the MoG learning algorithms on data100D.npy for K = f5; 10; 15; 20; 30g (Hold out 1/3 of the data for validation). Comment on how many clusters you think are within the dataset by looking at the MoG validation loss and Kmeans validation loss. Compare the learnt results of Kmeans and MoG.
4