$35.00
Description

Na•ve Bayes over Multinomial Distribution [30 pts]
In this question, we will look into training a na•ve Bayes classi er with a model that uses a multinomial distribution to represent documents. Assume that all the documents are written in a language which has only three words a, b, and c. All the documents have exactly n words (each word can be either a, b, or c). We are given a labeled document collection fD_{1}; D_{2}; : : : ; D_{m}g. The label y_{i} of document D_{i} is 1 or 0, indicating whether D_{i} is \good” or \bad”.
This model uses the multinominal distribution in the following way: Given the i^{th} document D_{i}, we denote by a_{i} (respectively, b_{i}, c_{i}) the number of times that word a (respectively, b, c) appears in D_{i}. Therefore, a_{i} + b_{i} + c_{i} = jD_{i}j = n. We de ne

Pr(D_{i}jy = 1) =
n!
_{1}^{a}^{i} _{1}^{b}^{i} _{1}^{c}^{i}
a_{i}!b_{i}!c_{i}!
where _{1} (respectively, _{1}, _{1}) is the probability that word a (respectively, b, c) appears in a \good” document. Therefore, _{1} + _{1} + _{1} = 1. Similarly,

Pr(D_{i}jy = 0) =
n!
_{0}^{a}^{i} _{0}^{b}^{i} _{0}^{c}^{i}
a_{i}!b_{i}!c_{i}!
where _{0} (respectively, _{0}, _{0}) is the probability that word a (respectively, b, c) appears in a \bad” document. Therefore, _{0} + _{0} + _{0} = 1.

(3 pts) What information do we lose when we represent documents using the aforementioned model?

(7 pts) Write down the expression for the log likelihood of the document D_{i}, log Pr(D_{i}; y_{i}). Assume that the prior probability, P r(y_{i} = 1) is .

(20 pts) Derive the expression for the maximum likelihood estimates for parameters _{1}, _{1}, _{1}, _{0}, _{0}, and _{0}.
Submission note: You need not show the derivation of all six parameters separately. Some parameters are symmetric to others, and so, once you derive the expression for one, you can directly write down the expression for others.
Grading note: 5 points for the derivation of one of the parameters, 3 points each for the remaining ve parameter expressions.
Consider a Hidden Markov Model with two hidden states, f1; 2g, and two possible output symbols, fA; Bg. The initial state probabilities are
_{1} = P (q_{1} = 1) = 0:49 and _{2 }= P (q_{1} = 2) = 0:51;
the state transition probabilities are
q_{11} = P (q_{t+1} = 1jq_{t} = 1) = 1 and q_{12} = P (q_{t+1} = 1jq_{t} = 2) = 1;
and the output probabilities are
e_{1}(A) = P (O_{t} = Ajq_{t} = 1) = 0:99 and e_{2}(B) = P (O_{t} = Bjq_{t} = 2) = 0:51:
Throughout this problem, make sure to show your work to receive full credit.

(5 pts) There are two unspeci ed transition probabilities and two unspeci ed output probabilities. What are the missing probabilities, and what are their values?

(5 pts) What is the most frequent output symbol (A or B) to appear in the rst position of sequences generated from this HMM?

(5 pts) What is the sequence of three output symbols that has the highest probability of being generated from this HMM model?

Facial Recognition by using KMeans and KMedoids [55 pts]
Machine learning techniques have been applied to a variety of image interpretation problems. In this problem, you will investigate facial recognition, which can be treated as a clustering problem (\separate these pictures of Joe and Mary”).
For this problem, we will use a small part of a huge database of faces of famous people (Labeled Faces in the Wild [LFW] people dataset^{1}). The images have already been cropped out of the original image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally, we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of 1850 \raw” features). Our dataset has a total of 1867 images of 19 di erent people. You will explore clustering methods such as kmeans and kmedoids to the problem of facial recognition on this dataset.
Download the starter les from the course website. It contains the following source les:
util.py { Utility methods for manipulating data.
cluster.py { Code for the Point, Cluster, and ClusterSet classes. faces.py { Main code
Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage you to include your own additional methods and functions. However, you are not allowed to use any scikitlearn classes or functions other than those already imported in the skeleton code.
We will explore clustering algorithms in detail by applying them to a toy dataset. In particular, we will investigate kmeans and kmedoids (a slight variation on kmeans).

(5 pts) In kmeans, we attempt to nd k cluster centers _{j} 2 R^{d}, j 2 f1; : : : ; kg and n
cluster assignments c^{(i)} 2 f1; : : : ; kg, i 2 f1; : : : ; ng, such that the total distance between each data point and the nearest cluster center is minimized. In other words, we attempt to nd _{1}; : : : ; _{k} and c^{(1)}; : : : ; c^{(n)} that minimizes
n
X
J(c; ) = jjx^{(i)} _{c}(i) jj^{2}:
i=1
To do so, we iterate between assigning x^{(i)} to the nearest cluster center c^{(i)} and updating each cluster center _{j} to the average of all points assigned to the j^{th} cluster.
Instead of holding the number of clusters k xed, one can think of minimizing the objective function over , c, and k. Show that this is a bad idea. Speci cally, what is the minimum possible value of J(c; ; k)? What values of c, , and k result in this value?

(10 pts) To implement our clustering algorithms, we will use Python classes to help us de ne three abstract data types: Point, Cluster, and ClusterSet (available in cluster.py). Read through the documentation for these classes. (You will be using these classes later, so make sure you know what functionality each class provides!) Some of the class methods are already implemented, and other methods are described in comments. Implement all of the methods marked TODO in the Cluster and ClusterSet classes.

(20 pts) Next, implement random_init(…) and kMeans(…) based on the provided speci cations.

(5 pts) Now test the performance of kmeans on a toy dataset.
Use generate_points_2d(…) to generate three clusters each containing 20 points. (You can modify generate_points_2d(…) to test di erent inputs while debugging your code, but be sure to return to the initial implementation before creating any plots for submission.) You can plot the clusters for each iteration using the plot_clusters(…) function.
In your writeup, include plots for the kmeans cluster assignments and corresponding cluster \centers” for each iteration when using random initialization.

(10 pts) Implement kMedoids(…) based on the provided speci cation.
Hint: Since kmeans and kmedoids are so similar, you may nd it useful to refactor your code to use a helper function kAverages(points, k, average, init= random , plot=True),
where average is a method that determines how to calculate the average of points in a cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).^{2}
As before, include plots for kmedoids clustering for each iteration when using random initialization.

(5 pts) Finally, we will explore the e ect of initialization. Implement cheat_init(…).
Now compare clustering by initializing using cheat_init(…). Include plots for kmeans and kmedoids for each iteration.

In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling func(arg). This works even if func is a class method and arg is an object that is an instance of the class.
3