Na•ve Bayes, Hidden Markov Models Solution



  • Na•ve Bayes over Multinomial Distribution [30 pts]

In this question, we will look into training a na•ve Bayes classi er with a model that uses a multino-mial distribution to represent documents. Assume that all the documents are written in a language which has only three words a, b, and c. All the documents have exactly n words (each word can be either a, b, or c). We are given a labeled document collection fD1; D2; : : : ; Dmg. The label yi of document Di is 1 or 0, indicating whether Di is \good” or \bad”.

This model uses the multinominal distribution in the following way: Given the ith document Di, we denote by ai (respectively, bi, ci) the number of times that word a (respectively, b, c) appears in Di. Therefore, ai + bi + ci = jDij = n. We de ne

Pr(Dijy = 1) =


1ai 1bi 1ci


where 1 (respectively, 1, 1) is the probability that word a (respectively, b, c) appears in a \good” document. Therefore, 1 + 1 + 1 = 1. Similarly,

Pr(Dijy = 0) =


0ai 0bi 0ci


where 0 (respectively, 0, 0) is the probability that word a (respectively, b, c) appears in a \bad” document. Therefore, 0 + 0 + 0 = 1.

  1. (3 pts) What information do we lose when we represent documents using the aforementioned model?

  1. (7 pts) Write down the expression for the log likelihood of the document Di, log Pr(Di; yi). Assume that the prior probability, P r(yi = 1) is .

  1. (20 pts) Derive the expression for the maximum likelihood estimates for parameters 1, 1, 1, 0, 0, and 0.

Submission note: You need not show the derivation of all six parameters separately. Some parameters are symmetric to others, and so, once you derive the expression for one, you can directly write down the expression for others.

Grading note: 5 points for the derivation of one of the parameters, 3 points each for the remaining ve parameter expressions.

  • Hidden Markov Models [15 pts]

Consider a Hidden Markov Model with two hidden states, f1; 2g, and two possible output symbols, fA; Bg. The initial state probabilities are

1 = P (q1 = 1) = 0:49 and 2 = P (q1 = 2) = 0:51;

the state transition probabilities are

q11 = P (qt+1 = 1jqt = 1) = 1 and q12 = P (qt+1 = 1jqt = 2) = 1;

and the output probabilities are

e1(A) = P (Ot = Ajqt = 1) = 0:99 and e2(B) = P (Ot = Bjqt = 2) = 0:51:

Throughout this problem, make sure to show your work to receive full credit.

  1. (5 pts) There are two unspeci ed transition probabilities and two unspeci ed output prob-abilities. What are the missing probabilities, and what are their values?

  1. (5 pts) What is the most frequent output symbol (A or B) to appear in the rst position of sequences generated from this HMM?

  1. (5 pts) What is the sequence of three output symbols that has the highest probability of being generated from this HMM model?

  • Facial Recognition by using K-Means and K-Medoids [55 pts]

Machine learning techniques have been applied to a variety of image interpretation problems. In this problem, you will investigate facial recognition, which can be treated as a clustering problem (\separate these pictures of Joe and Mary”).

For this problem, we will use a small part of a huge database of faces of famous people (Labeled Faces in the Wild [LFW] people dataset1). The images have already been cropped out of the original image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally, we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of 1850 \raw” features). Our dataset has a total of 1867 images of 19 di erent people. You will explore clustering methods such as k-means and k-medoids to the problem of facial recognition on this dataset.

Download the starter les from the course website. It contains the following source les: { Utility methods for manipulating data. { Code for the Point, Cluster, and ClusterSet classes. { Main code

Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage you to include your own additional methods and functions. However, you are not allowed to use any scikit-learn classes or functions other than those already imported in the skeleton code.

We will explore clustering algorithms in detail by applying them to a toy dataset. In particular, we will investigate k-means and k-medoids (a slight variation on k-means).

  1. (5 pts) In k-means, we attempt to nd k cluster centers j 2 Rd, j 2 f1; : : : ; kg and n

cluster assignments c(i) 2 f1; : : : ; kg, i 2 f1; : : : ; ng, such that the total distance between each data point and the nearest cluster center is minimized. In other words, we attempt to nd 1; : : : ; k and c(1); : : : ; c(n) that minimizes



J(c; ) = jjx(i) c(i) jj2:


To do so, we iterate between assigning x(i) to the nearest cluster center c(i) and updating each cluster center j to the average of all points assigned to the jth cluster.

Instead of holding the number of clusters k xed, one can think of minimizing the objective function over , c, and k. Show that this is a bad idea. Speci cally, what is the minimum possible value of J(c; ; k)? What values of c, , and k result in this value?

  1. (10 pts) To implement our clustering algorithms, we will use Python classes to help us de ne three abstract data types: Point, Cluster, and ClusterSet (available in Read through the documentation for these classes. (You will be using these classes later, so make sure you know what functionality each class provides!) Some of the class methods are already implemented, and other methods are described in comments. Implement all of the methods marked TODO in the Cluster and ClusterSet classes.

  1. (20 pts) Next, implement random_init(…) and kMeans(…) based on the provided spec-i cations.

  1. (5 pts) Now test the performance of k-means on a toy dataset.

Use generate_points_2d(…) to generate three clusters each containing 20 points. (You can modify generate_points_2d(…) to test di erent inputs while debugging your code, but be sure to return to the initial implementation before creating any plots for submission.) You can plot the clusters for each iteration using the plot_clusters(…) function.

In your writeup, include plots for the k-means cluster assignments and corresponding cluster \centers” for each iteration when using random initialization.

  1. (10 pts) Implement kMedoids(…) based on the provided speci cation.

Hint: Since k-means and k-medoids are so similar, you may nd it useful to refactor your code to use a helper function kAverages(points, k, average, init= random , plot=True),

where average is a method that determines how to calculate the average of points in a cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).2

As before, include plots for k-medoids clustering for each iteration when using random ini-tialization.

  1. (5 pts) Finally, we will explore the e ect of initialization. Implement cheat_init(…).

Now compare clustering by initializing using cheat_init(…). Include plots for k-means and k-medoids for each iteration.

  • In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling func(arg). This works even if func is a class method and arg is an object that is an instance of the class.