$30.00
Description
In this assignment, we’ll t both generative and discriminative models to the MNIST dataset of handwritten numbers. Each datapoint in the MNIST [http://yann.lecun.com/exdb/mnist/] dataset is a 28×28 blackandwhite image of a number in f0 : : : 9g, and a label indicating which number.
MNIST is the ‘fruit y’ of machine learning – a simple standard problem useful for comparing the properties of di erent algorithms. Python and R codes for loading and plotting MNIST is attached.
You can use whichever programming language you like, and libraries for loading and plotting data. You’ll need to write your own initialization, tting, and prediction code. You can use automatic di erentiation in your code, but must still answer the gradient questions.
For this assignment, we’ll binarize the dataset, converting the grey pixel values to either black or white (0 or 1) with > 0:5 being the cuto . When comparing models, we’ll need a training and test set. Use the rst 10000 samples for training, and another 10000 for testing. This is all done for you in the starter code. Hint: Also build a dataset of only 100 training samples to use when debugging, to make loading and training faster.

Fitting a Na•ve Bayes Model, 25 pts. In this question, we’ll t a na•ve Bayes model to the MNIST digits using maximum likelihood. Na•ve Bayes de nes the joint probability of the each datapoint x and its class label c as follows:

784
p(x; cj ; ) = p(cj )p(xjc; _{c}) = p(cj )
_{d}Y
p(x_{d}jc; _{cd})
=1
Here, is a matrix of probabilities for each pixel and each class, so its total size is 784 10.
For binary data, we can use the Bernoulli likelihood:
p(x_{d}jc; _{cd}) = Ber(x_{d}j _{cd}) = _{cd}^{x}^{d} (1 _{cd})^{(1} ^{x}^{d}^{)}
Which is just a way of expressing that p(x_{d} = 1jc; _{cd}) = _{cd}.
For p(cj ), we can just use a categorical distribution:
p(cj ) = Cat(cj ) = _{c}
^{P}9
Note that we need _{i}_{=0} _{i }= 1.
1
(a) Derive the maximum likelihood estimate (MLE) for the classconditional pixel means . Hint:
^
We saw in lecture that MLE can be thought of as `counts’ for the data, so what should _{cd} be counting?


Derive the maximum a posteriori (MAP) estimate for the classconditional pixel means , using a Beta(2, 2) prior on each . Hint: it has a simple nal form, and you can ignore the Beta normalizing constant.



Fit to the training set using the MAP estimator. Plot as 10 separate greyscale images, one for each class.



Derive the loglikelihood log p(cjx; ; ) for a single training image.



Given parameters t to the training set, and _{c} = _{10}^{1} , report both the average loglikelihood per datapoint,_{N}^{1} ^{N}_{i=1} log p(c_{i}jx_{i}; ; ) and the accuracy on both the training and test set. The accuracy is de ned as the fraction of examples where the true class t = argmax_{c} p(cjx; ; ).


Generating from a Na•ve Bayes Model, 25 pts. De ning a joint probabilistic model over the data lets us generate new data, and also lets us answer all sorts of queries about the data.


True or false: Given this model’s assumptions, any two pixels x_{i} and x_{j} where i 6= j are independent given c.

True or false: Given this model’s assumptions, any two pixels x_{i} and x_{j} where i 6= j are independent when marginalizing over c.



Using the parameters t in question 1, produce random image samples from the model. That

is, randomly sample and plot 10 binary images from the marginal distribution p(xj ; ). Hint: rst sample c.


One of the advantages of generative models is that they can handle missing data, or be used to answer di erent sorts of questions about the model. Derive p(x_{i2bottom}jx_{top}; ; ), the marginal distribution of a single pixel in the bottom half of an image given the top half, conditioned on your t parameters. Hint: you’ll have to marginalize over c.



For 20 images from the training set, plot the top half the image concatenated with the marginal distribution over each pixel in the bottom half. i.e. the bottom half of the image should use grayscale to represent the marginal probability of each pixel being on.


Logistic Regression, 25 pts. Now, we’ll t a simple predictive model using gradient descent. Our model will be multiclass logistic regression:

exp(w^{T} x)
p(cjx; w) =
c
9
T
P
_{c}0_{=0} exp(w_{c}0x)
Omit bias parameters for this question.

How many parameters does this model have?

Derive the gradient of the predictive loglikelihood w.r.t. w: r_{w} log p(cjx; w)

Code up a gradientbased optimizer of your choosing, it can be just vanilla gradient descent, and use it to optimize w to maximize the loglikelihood of the training set, and plot the resulting parameters using one image per class. Since this objective is concave, you can initialize at all zeros. You can use the gradient ascent (since it is a maximization) updates as covered in class or use automatic di erentiation (autograd). However, you are not permitted to use optimizers which come built in to packages! Hint: Use scipy.logsumexp or its
2
equivalent to make your code more numerically stable. Also, try to avoid nested for loops, and instead use matrix operations to keep your code fast.

Given parameters t to the training set, report both the average loglikelihood per datapoint, and the accuracy on both the training and test set. How does it compare to Na•ve Bayes?

Unsupervised Learning, 25 pts. In this problem, we will experiment with two clus
tering algorithms: (i) kmeans, and (ii) EM algorithm for Gaussian mixtures. In what follows, N denotes the number of data points, D denotes the dimension of each data point, x_{i} 2 R^{D} denotes a single data point, K denotes the number of clusters, and x_{i} p(xjC_{k}) for k = 1; :::; K denotes the data generating distribution.

First, we will generate some data for this problem. Set the number of points N = 400,
their dimension D = 2, and the number of clusters K = 2, and generate data from the distribution p(xjC_{k}) = N ( _{k}; _{k}). Sample 200 data points for C_{1} and 200 for C_{2}, with

_{1}
=
0:1
, _{2}
=
0:1
and _{1}
= _{2}
=
7
10
0:1
6:0
10
7
Here, N = 400. Since you generated the data, you already know which sample comes from which class. Make a scatter plot of the data points showing the true cluster assignment of each point either using di erent color codes or shape or both.

Now, we assume that the true class labels are not known. Implement the kmeans algorithm for this problem. Write three functions: cost, km_e_step, and km_m_step as given in the lecture (Here, km_ means kmeans). Identify the correct arguments, and the order to run them. Initialize the algorithm with

(4.1)
^_{1}
=
0:0
, ^_{2}
=
1:0
0:0
1:0
and run it until convergence. Show the resulting cluster assignments on a scatter plot either using di erent color codes or shape or both. Also plot the cost vs. the number of iterations. Report your misclassi cation error.

Next, implement the EM algorithm for Gaussian mixtures. Write four functions: normal_density, loglikelihood, em_e_step, and em_m_step as given in the lecture. Identify the correct arguments, and the order to run them. Initialize the algorithm with means as in (4.1), co
^ ^
variances with _{1} = _{2} = I, and ^_{1} = ^_{2}. Run the algorithm until convergence and show the resulting cluster assignments on a scatter plot either using di erent color codes or shape or both. Also plot the loglikelihood vs. the number of iterations. Report your misclassi cation error.

Comment on your ndings, i.e., resulting cluster assignments, number of iterations until convergence, etc. Experiment with di erent data realizations (generate new data), run your algorithms, and summarize your ndings. Does the algorithm performance depend on different realizations of data? Hint: In most cases, kmeans should fail to identify the correct cluster labels due to the covariance structure. There may be realizations that EM also fails to nd the clusters correctly but in general it should work better than kmeans.
3