HOMEWORK 2 – V1 Solution

$30.00

Description

In this assignment, we’ll t both generative and discriminative models to the MNIST dataset of handwritten numbers. Each datapoint in the MNIST [http://yann.lecun.com/exdb/mnist/] dataset is a 28×28 black-and-white image of a number in f0 : : : 9g, and a label indicating which number.

MNIST is the ‘fruit y’ of machine learning – a simple standard problem useful for comparing the properties of di erent algorithms. Python and R codes for loading and plotting MNIST is attached.

You can use whichever programming language you like, and libraries for loading and plotting data. You’ll need to write your own initialization, tting, and prediction code. You can use auto-matic di erentiation in your code, but must still answer the gradient questions.

For this assignment, we’ll binarize the dataset, converting the grey pixel values to either black or white (0 or 1) with > 0:5 being the cuto . When comparing models, we’ll need a training and test set. Use the rst 10000 samples for training, and another 10000 for testing. This is all done for you in the starter code. Hint: Also build a dataset of only 100 training samples to use when debugging, to make loading and training faster.

  1. Fitting a Na•ve Bayes Model, 25 pts. In this question, we’ll t a na•ve Bayes model to the MNIST digits using maximum likelihood. Na•ve Bayes de nes the joint probability of the each datapoint x and its class label c as follows:

784

p(x; cj ; ) = p(cj )p(xjc; c) = p(cj )

dY

p(xdjc; cd)

=1

Here, is a matrix of probabilities for each pixel and each class, so its total size is 784 10.

For binary data, we can use the Bernoulli likelihood:

p(xdjc; cd) = Ber(xdj cd) = cdxd (1 cd)(1 xd)

Which is just a way of expressing that p(xd = 1jc; cd) = cd.

For p(cj ), we can just use a categorical distribution:

p(cj ) = Cat(cj ) = c

P9

Note that we need i=0 i = 1.

1

(a) Derive the maximum likelihood estimate (MLE) for the class-conditional pixel means . Hint:

^

We saw in lecture that MLE can be thought of as `counts’ for the data, so what should cd be counting?

    1. Derive the maximum a posteriori (MAP) estimate for the class-conditional pixel means , using a Beta(2, 2) prior on each . Hint: it has a simple nal form, and you can ignore the Beta normalizing constant.

    1. Fit to the training set using the MAP estimator. Plot as 10 separate greyscale images, one for each class.

    1. Derive the log-likelihood log p(cjx; ; ) for a single training image.

    1. Given parameters t to the training set, and c = 101 , report both the average log-likelihood per datapoint,N1 Ni=1 log p(cijxi; ; ) and the accuracy on both the training and test set. The accuracy is de ned as the fraction of examples where the true class t = argmaxc p(cjx; ; ).

  1. Generating from a Na•ve Bayes Model, 25 pts. De ning a joint probabilistic model over the data lets us generate new data, and also lets us answer all sorts of queries about the data.

    1. True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are independent given c.

    2. True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are independent when marginalizing over c.

    1. Using the parameters t in question 1, produce random image samples from the model. That

is, randomly sample and plot 10 binary images from the marginal distribution p(xj ; ). Hint: rst sample c.

    1. One of the advantages of generative models is that they can handle missing data, or be used to answer di erent sorts of questions about the model. Derive p(xi2bottomjxtop; ; ), the marginal distribution of a single pixel in the bottom half of an image given the top half, conditioned on your t parameters. Hint: you’ll have to marginalize over c.

    1. For 20 images from the training set, plot the top half the image concatenated with the marginal distribution over each pixel in the bottom half. i.e. the bottom half of the image should use grayscale to represent the marginal probability of each pixel being on.

  1. Logistic Regression, 25 pts. Now, we’ll t a simple predictive model using gradient descent. Our model will be multiclass logistic regression:

exp(wT x)

p(cjx; w) =

c

9

T

P

c0=0 exp(wc0x)

Omit bias parameters for this question.

  1. How many parameters does this model have?

  2. Derive the gradient of the predictive log-likelihood w.r.t. w: rw log p(cjx; w)

  3. Code up a gradient-based optimizer of your choosing, it can be just vanilla gradient descent, and use it to optimize w to maximize the log-likelihood of the training set, and plot the resulting parameters using one image per class. Since this objective is concave, you can ini-tialize at all zeros. You can use the gradient ascent (since it is a maximization) updates as covered in class or use automatic di erentiation (autograd). However, you are not permit-ted to use optimizers which come built in to packages! Hint: Use scipy.logsumexp or its

2

equivalent to make your code more numerically stable. Also, try to avoid nested for loops, and instead use matrix operations to keep your code fast.

  1. Given parameters t to the training set, report both the average log-likelihood per datapoint, and the accuracy on both the training and test set. How does it compare to Na•ve Bayes?

  1. Unsupervised Learning, 25 pts. In this problem, we will experiment with two clus-

tering algorithms: (i) k-means, and (ii) EM algorithm for Gaussian mixtures. In what follows, N denotes the number of data points, D denotes the dimension of each data point, xi 2 RD denotes a single data point, K denotes the number of clusters, and xi p(xjCk) for k = 1; :::; K denotes the data generating distribution.

  1. First, we will generate some data for this problem. Set the number of points N = 400,

their dimension D = 2, and the number of clusters K = 2, and generate data from the distribution p(xjCk) = N ( k; k). Sample 200 data points for C1 and 200 for C2, with

1

=

0:1

, 2

=

0:1

and 1

= 2

=

7

10

0:1

6:0

10

7

Here, N = 400. Since you generated the data, you already know which sample comes from which class. Make a scatter plot of the data points showing the true cluster assignment of each point either using di erent color codes or shape or both.

  1. Now, we assume that the true class labels are not known. Implement the k-means algorithm for this problem. Write three functions: cost, km_e_step, and km_m_step as given in the lecture (Here, km_ means k-means). Identify the correct arguments, and the order to run them. Initialize the algorithm with

(4.1)

^1

=

0:0

, ^2

=

1:0

0:0

1:0

and run it until convergence. Show the resulting cluster assignments on a scatter plot either using di erent color codes or shape or both. Also plot the cost vs. the number of iterations. Report your misclassi cation error.

  1. Next, implement the EM algorithm for Gaussian mixtures. Write four functions: normal_density, log-likelihood, em_e_step, and em_m_step as given in the lecture. Identify the correct arguments, and the order to run them. Initialize the algorithm with means as in (4.1), co-

^ ^

variances with 1 = 2 = I, and ^1 = ^2. Run the algorithm until convergence and show the resulting cluster assignments on a scatter plot either using di erent color codes or shape or both. Also plot the log-likelihood vs. the number of iterations. Report your misclassi cation error.

  1. Comment on your ndings, i.e., resulting cluster assignments, number of iterations until convergence, etc. Experiment with di erent data realizations (generate new data), run your algorithms, and summarize your ndings. Does the algorithm performance depend on dif-ferent realizations of data? Hint: In most cases, k-means should fail to identify the correct cluster labels due to the covariance structure. There may be realizations that EM also fails to nd the clusters correctly but in general it should work better than k-means.

3