Solved–Homework 1– Solution

$35.00 $24.00

Instructions Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date. On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page: https://gradescope.com/help#help-center-section-student-workflow (If you don’t select the…

You’ll get a: . zip file solution

 

 
Categorys:
Tags:

Description

5/5 – (2 votes)

Instructions

Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date.

On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page:

(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)

Make sure your name and your UNI appears prominently on the first page of your write-up.

Source code

Please combine all requested source code files into a single ZIP file1, along with a plain text file called README that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data files. Submit this ZIP file on Courseworks.

Clarity and precision

One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.

To reason about these things, you must be able to make clear and precise claims and arguments about them.

A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from the clarity of an argument.

Points may be deducted for answers and arguments that lack sufficient clarity or precision. Moreover, a time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will be based on this interpretation.

Problem 1 (30 points)

In this problem, you will implement and evaluate the nearest neighbor rule.

MNIST data set

Download the MNIST data set of handwritten digit images ocr.mat from the course website, and load it into Python:

from scipy.io import loadmat

ocr = loadmat(‘ocr.mat’)

The 60000 unlabeled training data (i.e., feature vectors) are contained in a matrix called data (one data point per row), and the corresponding labels are in a vector called labels. The 10000 test feature vectors and labels are in, respectively, testdata and testlabels.

In Python, you can view an image (say, the first one) with the following commands:

import matplotlib.pyplot as plt

from matplotlib import cm

plt.imshow(ocr[‘data’][0].reshape((28,28)), cmap=cm.gray_r)

plt.show()

Nearest neighbor classifier

Let the training examples be denoted by (x1, y1), . . . , (xn, yn) (for n = 60000), where each xi Rd and each yi {0, . . . , 9}.

The nearest neighbor classifier fˆ: Rd → {0, . . . , 9} is a function defined in terms of the 60000 training examples as follows. On input x Rd, let i(x) := arg mini{1,…,n} kxxi k2, breaking ties arbitrarily (i.e., i(x) is any i {1, . . . , n} for which kx xi(x)k2 mini{1,…,n} kx xik2); and return yi(x).2

Write a program in Python that implements the nearest neighbor classifier. Your program should take as input a matrix of training feature vectors X and a vector of corresponding labels Y, as well as a matrix of test feature vectors test. The output of the program should be a vector of the predicted labels preds for all test points, as provided by the nearest neighbor classifier fˆ based on the examples in X and Y.

You should not use (or look at the source code for) any library functions for computing all pairwise distances between entire sets of vectors, or for computing nearest neighbor queries—that would defeat the purpose of this assignment. However, you can use functions for computing inner products between vectors and norms of vectors, as well as other basic matrix and vector operations. If in doubt what is okay to use, just ask.

In order to make your code efficient and not take a long time to run, you should use matrix/vector operations (rather than, say, a bunch of explicit for-loops). Think of the n training feature vectors as being stacked as the rows of a matrix X Rn×d, and the m test feature vectors as the rows of another matrix T Rm×d. If the i-th row of T is t>i and the j-th row of X is x>j, then the squared Euclidean distance between ti and xj is

ktixjk22 = (tixj)>(tixj) = t>iti − 2t>ixj + x>jxj.

Can you leverage this identity to speed-up computation of the nearest neighbor classifier?3

  • The expression “arg minaA f(a)” denotes the set of minimizers of the function f among all elements in the set A. On occasion, as we do here, we’ll be a little sloppy and write expressions of the form “aˆ := arg minaA f(a)”. This will always means that aˆ is some minimizer of f within A, rather than the set of minimizers. For this to really make sense, though, it must be made clear (either explicitly or implicitly) what we mean in the case that there are multiple minimizers (or if there are none!).

3In order to take advantage of matrix/vector operations, your Python installation needs to be using optimized linear algebra libraries, such as ATLAS, MKL, or the Accelerate/vecLib framework on OS X. If you installed Python using Anaconda, then you should already have MKL.

2Evaluating the nearest neighbor classifier

Instead of using your nearest neighbor program with data and labels as the training data, do the following.

For each value n {1000, 2000, 4000, 8000}:

  1. Draw n points from data, together with their corresponding labels, uniformly at random. Use sel = random.sample(range(60000,n) (after import random), ocr[‘data’][sel].astype(‘float’), and ocr[‘labels’][sel] to select the examples.4

  1. Use these n points as the training data and testdata as the test points; compute the fraction of test examples on which the nearest neighbor classifier predicts the label incorrectly (i.e., the test error rate).

A plot of the test error rate (on the y-axis) as a function of n (on the x-axis) is called a learning curve.

Carry out the (random) process described above ten times, independently. Produce a learning curve plot using the average of these test error rates (that is, averaging over ten repetitions). Add error bars to your plot that extend to one standard deviation above and below the means. Ensure the plot axes are properly labeled.

What to submit in your write-up:

  1. A brief description of how you implemented the nearest neighbor classifier (especially the “arg min”). Note that it is fine to use numpy.argmin, but you should explain precisely how you use it in your code to efficiently implement the nearest neighbor classification of new (test) points.

  1. Learning curve plot. (Please include the plot directly in the PDF write-up.)

  1. What would the learning curve look like if, instead of test error rate, you computed training error rate?

Please submit your source code on Courseworks.

Problem 2 (35 points)

In this problem, you will practice deriving formulae for maximum likelihood estimators.

Consider a statistical model for iid random variables X1, . . . , Xn, parameterized by θ Θ. The distribution Pθ of X1 is specified in each part below.

In each of the first two cases, derive a simple formula for the MLE for θ (given data (X1, . . . , Xn) = (x1, . . . , xn)), and briefly justify each step of the derivation.

(a) X1 Pθ has probability density function fθ satisfying

fθ(x) x2ex/θ for all x > 0,

and fθ(x) = 0 for all x ≤ 0.5 The parameter space is the positive reals Θ = {θ R : θ > 0}.

(b) X1 Pθ has probability density function fθ satisfying

fθ(x) 1 for all 0 ≤ xθ,

and fθ(x) = 0 for all x < 0 and all x > θ. The parameter space is the positive reals Θ = {θ R : θ > 0}.

In each of the next two cases, specific values of the data are given as follows:

(X1, . . . , Xn) = (0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 0, 2, 2, 0, 3, 0, 0, 0, 2).

Derive the actual MLE value of θ given this data (as opposed to an abstract formula), and briefly justify each step of the derivation.

(c) X1 Pθ has probability density function fθ satisfying

fθ(x) e−(xθ)2/(2θ2) for all x R.

The parameter space is Θ = {θ R : θ 6= 0} (i.e., all real numbers θ such that θ 6= 0).

(d) X1 Pθ has probability mass function pθ satisfying

pθ(0) =

1

, pθ(1) = eθ, pθ(2) = 2eθ,

pθ(3) =

1

3eθ.

2

2

The parameter space is Θ = {θ R : θ < − ln(6)}.

  • The symbol “” means “proportional to”. Remember, probability density functions should integrate to one.

Problem 3 (35 points)

In this problem, you will implement a learning algorithm based on generative models for classification, apply the learning algorithm to a simple data set, and quantitatively and qualitatively study the learned classifiers.

Download the “20 Newsgroups data set” news.mat from Courseworks. The training feature vectors/labels and test feature vectors/labels are stored as data/labels and testdata/testlabels. Each data point

corresponds to a message posted to one of 20 different newsgroups (i.e., message boards). The representation of a message is a (sparse) binary vector in X := {0, 1}d (for d := 61188) that indicates the words that are present in the message. If the j-th entry in the vector is 1, it means the message contains the word that is given on the j-th line of the text file news.vocab. The class labels are Y := {1, 2, . . . , 20}, where the mapping from classes to newsgroups is in the file news.groups (which we won’t actually need).

In this problem, you’ll develop a classifier based on a Naive Bayes generative model. Here, we use class

conditional distributions with probability mass functions of the form pµ(x) = Qd µxj (1 µj)1−xj for

j=1 j

  • = (x1, x2, . . . , xd) X . The parameter vector is µ = (µ1, µ2, . . . , µd) must come from the parameter space [0, 1]d. Since there are 20 classes, the generative model is actually parameterized by 20 such vectors,

µy = (µy,1, µy,2, . . . , µy,d) for each y Y, as well as the class prior parameters, πy for each y Y. The class

P

prior parameters, of course, must satisfy πy [0, 1] for each y Y and y Y πy = 1.

  1. Give the formula for the MLE of the parameter µy,j based on training data {(xi, yi)}ni=1. (Remember, each unlabeled point is a vector: xi = (xi,1, xi,2, . . . , xi,d) {0, 1}d.)

  2. MLE is not a good estimator for the class conditional parameters if the estimate turns out to be zero or one. An alternative is the following estimator based on a technique called Laplace smoothing:

µˆy,j :=

n

1{yi = y}xi,j /

2 +

n

1{yi = y}

(0, 1).

1 + i=1

i=1

X

X

Write code for training and testing a classifier based on the Naive Bayes generative model described above. Use Laplace smoothing to estimate class conditional distribution parameters, and MLE for class

prior parameters. You should not use or look at any existing implementation (e.g., those that may be provided as library functions). Using your code, train and test a classifier with the data from news.mat.

What to submit: training and test error rates.

(c) Consider the binary classification problem, where newsgroups {1, 16, 20} comprise the “negative class” (class 0), and newsgroups {17, 18, 19} comprise the “positive class” (class 1). Newsgroups {1, 16, 20} are “religious” topics, and newsgroups {17, 18, 19} are “political” topics. Modify the data in news.mat to create the training and test data sets for this problem. Using these data and your codes from part (b), train and test a Naive Bayes classifier. Report the training and test error rates. Save the learned classifier for part (d).

What to submit: training and test error rates.

(d) The classifier you learn is ultimately an affine classifier, which means it has the following form:

(1

if α0 +

Pj=1 αjxj > 0

0

if α0 +

d

αjxj 0

x

j=1

7→

d

P

for some real numbers α0, α1, . . . , αd. Determine the values of these αj’s for your learned classifier from part (c). Then, report the vocabulary words whose indices j {1, 2, . . . , d} correspond to the 20 largest (i.e., most positive) αj value, and also the vocabulary words whose indices j {1, 2, . . . , d} correspond to the 20 smallest (i.e., most negative) αj value. Don’t report the indices j’s, but rather the actual vocabulary words (from news.vocab).

What to submit: two ordered lists (clearly labeled) of 20 words each.

Please also submit your source code on Courseworks.

5