Assignment 3 Unsupervised Learning and Probabilistic Models Solution




In this assignment, you will implement learning and inference procedures for some of the proba-bilistic models described in class, apply your solutions to some simulated datasets, and analyze the results.

General Note:

  • Full points are given for complete solutions, including justifying the choices or assumptions you made to solve the question. Both complete source code and program outputs should be included in the nal submission.

  • Homework assignments are to be solved in the assigned groups of two. You are encouraged to discuss the assignment with other students, but you must solve it within your own group. Make sure to be closely involved in all aspects of the assignment.

  • There are 3 starter les attached,, starter and starter which will help you with your implementation.

  • K-means [9 pt.]

K-means clustering is one of the most widely used data analysis algorithms. It is used to summarize data by discovering a set of data prototypes that represent clusters of data. The data prototypes are usually referred to as cluster centers. Usually, K-means clustering proceeds by alternating between assigning data points to clusters and then updating the cluster centers. In this assignment, we will investigate a di erent learning algorithm that directly minimizes the K-means clustering loss function.


1.1 Learning K-means 2 MIXTURES OF GAUSSIANS [16 PT.]

1.1 Learning K-means

The K cluster centers can be thought of as K, D-dimensional parameter vectors and we can place them in a K D parameter matrix , where the kth row of the matrix denotes the kth cluster center k. The goal of K-means clustering is to learn such that it minimizes the loss

function, L( ) =


kk22, where N is the number of the training observations.

n=1 minkK=1 kxn

Even though the

loss function is not smooth due to the \min” operation, one may still be able


to nd its solutions through iterative gradient-based optimization. The \min” operation leads to discontinuous derivatives, in a way that is similar to the e ect of the ReLU activation function, but nonetheless, a good gradient-based optimizer can work e ectively.

  1. Implement the distanceFunc() function in the starter le to calculate the squared pairwise distance for all pair of N data points and K clusters.

For the dataset data2D.npy, set K = 3 and nd the K-means clusters by minimizing the L( ) using the gradient descent optimizer. The parameters should be initialized by sampling from the standard normal distribution. Include a plot of the loss vs the number of updates. Hints: you may want to use the Adam optimizer for this assignment with fol-lowing hyper-parameter tf.train.AdamOptimizer(learning rate=0.1, beta1=0.9, beta2=0.99, epsilon=1e-5). The learning should converge within a few hundred updates.

  1. Run the algorithm with K = 1; 2; 3; 4; 5 and for each of these values of K, compute and report the percentage of the data points belonging to each of the K clusters. Comment on how many clusters you think is \best” and why? (To answer this, it may be helpful discuss this value in the context of a 2D scatter plot of the data.) Include the 2D scatter plot of data points colored by their cluster assignments.

  1. Hold 1/3 of the data out for validation. For each value of K above, cluster the training data and then compute and report the loss for the validation data. How many clusters do you think is best?

  • Mixtures of Gaussians [16 pt.]

Mixtures of Gaussians (MoG) can be interpreted as a probabilistic version of K-means clus-tering. For each data vector, MoG uses a latent variable z to represent the cluster assign-

ment and uses a joint probability model of the cluster assignment variable and the data vec-tor: P (x; z) = P (z)P (x j z). For N IID training cases, we have P (X; z) = QNn=1 P (xn; zn). The Expectation-Maximization (EM) algorithm is the most commonly used technique to learn a MoG.

Like the standard K-means clustering algorithm, the EM algorithm alternates between updating the cluster assignment variables and the cluster parameters. What makes it di erent is that in-stead of making hard assignments of data vectors to cluster centers (the \min” operation above),

the EM algorithm computes probabilities for di erent cluster centers, P (zjx). These are computed from P (z = kjx) = P (x; z = k)= PKj=1 P (x; z = j).


2.1 The Gaussian cluster mode [7 pt.] 2 MIXTURES OF GAUSSIANS [16 PT.]

While the Expectation-Maximization (EM) algorithm is typically the go-to learning algorithm to train MoG and is guaranteed to converge to a local optimum, it su ers from slow convergence. In this assignment, we will explore a di erent learning algorithm that makes use of gradient descent.

2.1 The Gaussian cluster mode [7 pt.]

Each of the K mixture components in the MoG model occurs with probability k = P (z = k). The data model is a multivariate Gaussian distribution centered at the cluster mean (data center) k 2 RD. We will consider a MoG model where it is assumed that for the multivariate Gaussian for cluster k, di erent data dimensions are independent and have the same standard deviation, k.

  1. Modify the K-means distance function implemented in 1.1 to compute the log probability density function for cluster k: log N (x ; k; k2) for all pair of N data points and K clusters. Include the snippets of the Python code.

  1. Write a vectorized Tensor ow Python function that computes the log probability of the clus-ter variable z given the data vector x: log P (zjx). The log Gaussian pdf function implemented above should come in handy. The implementation should use the function reduce logsumexp() provided in the helper functions le. Include the snippets of the Python code and comment on why it is important to use the log-sum-exp function instead of using tf.reduce sum.

2.2 Learning the MoG [9 pt.]

The marginal data likelihood for the MoG model is as follows (here \marginal” refers to summing over the cluster assignment variables):

  • N K

  • Y X

P(X) =

P (xn) =

P (zn = k)P (xn j zn = k)


n=1 k=1

= Yn


kN (xn ; k; k2)

The loss function we will minimize is the negative log likelihood L( ; ; ) = log P (X). The maximum likelihood estimate (MLE) is a set of the model parameters ; ; that maximize the log likelihood or, equivalently, minimize the negative log likelihood.

1. Implement the loss function using log-sum-exp function and perform MLE by directly opti-mizing the log likelihood function using gradient descent in Tensor ow. Note that the stan-dard deviation has the constraint of 2 [0; 1). One way to deal with this constraint is to re-place 2 with exp( ) in the math and the software, where is an unconstrained parameter. In addition, has a simplex constraint, that is Pk k = 1. We can again replace this constrain with unconstrained parameter through a softmax function k = exp( k)= Pk0 exp( k0 ). A log-softmax function, logsoftmax, is provided for convenience in the helper functions le. For the dataset data2D.npy, set K = 3 and report the best model parameters it has learnt. Include a plot of the loss vs the number of updates.


2.2 Learning the MoG [9 pt.] 2 MIXTURES OF GAUSSIANS [16 PT.]

  1. Hold out 1/3 of the data for validation and for each value of K = f1; 2; 3; 4; 5g, train a MoG model. For each K, compute and report the loss function for the validation data and explain which value of K is best. Include a 2D scatter plot of data points colored by their cluster assignments.

  1. Run both the K-means and the MoG learning algorithms on data100D.npy for K = f5; 10; 15; 20; 30g (Hold out 1/3 of the data for validation). Comment on how many clusters you think are within the dataset by looking at the MoG validation loss and K-means validation loss. Compare the learnt results of K-means and MoG.