HOMEWORK 3 V3 Solution



  1. Fitting a Na ve Bayes Model, 40 pts. In this question, we’ll t a Na ve Bayes model to the MNIST digits using maximum likelihood. The starter code will download the dataset and parse it for you: Each training sample (t(i); x(i)) is composed of a vectorized binary image x(i) 2 f0; 1g784, and 1-of-10 encoded class label t(i), i.e., t(ci) = 1 means image i belongs to class c.

For p(c j ) = c and p(xj = 1 j c; ; ) = jc, Na ve Bayes de nes the joint probability of the each data point x and its class label c as follows:



p(x; c j ; ) = p(c j ; )p(x j c; ; ) = p(c j ) p(xj j c; jc):


Here, is a matrix of probabilities for each pixel and each class, so its dimensions are 784 10 (Note that in the lecture, we simpli ed notation and didn’t write the probabilities conditioned on the parameters, i.e. p(cj ) is written as p(c) in lecture slides).

For binary data, we can write the Bernoulli likelihood as

(1.1) p(xj j c; jc) = jcxj (1 jc)(1 xj);

which is just a way of expressing p(xj = 1jc; jc) = jc and p(xj = 0jc; jc) = 1 jc in a compact form. For the prior p(t j ), we use a categorical distribution (generalization of Bernoulli distribution to multi-class case),




p(tc = 1 j ) = p(c j ) = c or equivalently p(t j ) = j=0 j where

i = 1;


where p(c j ) and p(t j ) can be used interchangeably. You will t the parameters and using MLE and MAP techniques, and both cases below, your tting procedure can be written as a few simple matrix multiplication operations.

  1. First, derive the maximum likelihood estimator (MLE) for the class-conditional pixel prob-abilities and the prior . Hint-1: We saw in lecture that MLE can be thought of as ‘ratio



of counts’ for the data, so what should jc be counting? Derivations should be rigorous. Hint-2: Similar to the binary case, when calculating the MLE for j for j = 0; 1; :::; 8, write


p(t(i) j ) = 9j=0 jj and in the log-likelihood replace 9 = 1 8j=0 j, and then take derivatives w.r.t. j. This will give you the ratio ^j=^9 for j = 0; 1; ::; 8. You know that ^j’s

sum up to 1.

  1. Derive the log-likelihood log p(tjx; ; ) for a single training image.

  1. Fit the parameters and using the training set with MLE, and try to report the average







log-likelihood per data point

i=1 log p(t

; ; ^), using Equation (1.1). What goes


wrong? (it’s okay if you can’t compute the average log-likelihood here).



Plot the MLE estimator as 10 separate greyscale images, one for each class.


Derive the Maximum A posteriori Probability (MAP) estimator for the class-conditional

pixel probabilities , using a Beta(3, 3) prior on each jc. Hint: it has a simple nal form,

and you can ignore the Beta normalizing constant.


Fit the parameters and using the training set with MAP estimators from previous part,







and report both the average log-likelihood per data point,

i=1 log p(t

; ; ^), and


the accuracy on both the training and test set. The accuracy is de ned as the fraction of


examples where the true class is correctly predicted using c^ = argmaxc log p(tc = 1jx; ; ^).



Plot the MAP estimator as 10 separate greyscale images, one for each class.

  1. Generating from a Na ve Bayes Model, 30 pts. De ning a joint probability distri-bution over the data lets us generate new data, and also lets us answer all sorts of queries about the data. This is why these models are called Generative Models. We will use the Na ve Bayes model trained in previous question to generate data.

  1. True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are independent given c.

  2. True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are independent after marginalizing over c.

  1. Using the parameters t using MAP in Question 1, produce random image samples from the model. That is, randomly sample and plot 10 binary images from the marginal distribution

^ ^

p(xj ; ^). Hint: To sample from p(x j ; ^), rst sample random variable c from p(c j ^) using j ^

np.random.choice, then depending on the value of c, sample xj from p(xj c; jc) for j =

1; :::; 784 using np.random.binomial(1,..). These functions can take matrix probabilities as input, so your solution to this part should be a few lines of code.

  1. (Optional – 0 pts) One of the advantages of generative models is that they can handle missing data, or be used to answer di erent sorts of questions about the model. Derive p(xbottomjxtop; ; ), the marginal distribution of a single pixel in the bottom half of an image given the top half, conditioned on your t parameters. Hint: you’ll have to marginalize over c.

  1. (Optional – 0 pts) For 20 images from the training set, plot the top half the image concate-nated with the marginal distribution over each pixel in the bottom half. i.e. the bottom half of the image should use grayscale to represent the marginal probability of each pixel being 1 (darker for values close to 1).


  1. Principal Component Analysis, 30 pts. Using the numpy data le digits.npy and the utils.py dataloading helper code, you will nd 6 sets of 16 16 greyscale images in vector format (the pixel intensities are between 0 and 1 and were read into the vectors in a raster-scan manner). The images contain handwritten 2’s and 3’s, scanned from postal envelopes. train2 and train3 contain examples of 2’s and 3’s respectively to be used for training. There are 300 examples of each digit, stored as 300 256 matrices. Note that each data vector is a row of data matrices returned by load_data function. valid2 and valid3 contain data to be used for validation (100 examples of each digit) and test2 and test3 contain test data to be used for nal evaluation only (200 examples of each digit).

Apply the PCA algorithm to the 600 x 256 digit images (computing all 256 of the eigenvalues and eigenvectors, don’t forget to center the data). Then you should plot the (sorted) eigenvalues as a descending curve. This plot shows the spectrum of the data, and roughly tells you how much variance is contained along each eigenvector direction. Then view the rst 3 eigen-images (reshape each of the rst 3 eigenvectors and use imagesc to see these as images) as well as the mean of data. This part is for you to gain some intuition about how PCA works. You do not need to write this part up!

  1. For each image in the validation set, subtract the mean of training data and project it into the low-dimensional space spanned by the rst K principal components of training data. After projection, use a 1-NN classi er on K dimensional features (the code vectors) to classify the digit in the low-dimensional space. You need to implement the classi er yourself. You will do the classi cation under di erent K values to see the e ect of K. Here, let K = 2, 5, 10, 20, 30, and under each K, classify the validation digits using 1-NN. Plot the results, where the plot should show the curve of validation set classi cation error rates versus number of eigenvectors you keep, i.e., K.

  1. If you wanted to choose a particular model from your experiments as the best, which model (number of eigenvectors) would you select? Why?

  1. Report the performance of your nal classi er over the test data.