Theory Assignment 1 Solution

$30.00

Description

Instructions

Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named ‘Theory Assignment 1’ set up in which you can submit your files. Please be sure to follow all directions outlined here.

You can submit multiple times, but only the last submission counts. That means if you finish some prob-lems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (re-member Murphy’s Law), you still get credit for whatever you have turned in.

Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submis-sion must be a single PDF. It is strongly recommended that you typeset with LATEX. There are many free integrated LATEX editors that are convenient to use (e.g Overleaf, ShareLaTeX). Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not know how to use LATEX yet.

Please also follow the rules below:

  • The file should be named as firstname lastname USCID.pdf e.g., Don Quijote de la Mancha 8675309045.pdf).

  • Do not have any spaces in your file name when uploading it.

  • Please include your name and USCID in the header of your report as well.

Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration. Review the rules on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly more than you could stand to gain.

Notes on notation:

  • Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

  • k.k means L2-norm unless specified otherwise i.e. k.k = k.k2

1

Problem 1 Nearest Neighbor Classification (8 points)

In the class, we use the Euclidean distance as the distance metric for the nearest neighbor classification.

Given data x 2 RD, the Euclidean distance between xi and xj is defined as following:

2

D

E(xi, xj) = xi

xj

= å(xid xjd)2

(1)

2

d=1

In some applications such as information retrieval and neural language processing, the cosine distance is widely applied. It is defined as:

C(xi, xj) = 1

where the norm of x is defined as

xiT xj

= 1

ådD=1(xid xjd)

,

kxik2kxjk2

kxik2kxjk2

v

u D

kxk = u å x2.

2 t d

d=1

(2)

(3)

Now you are asked to prove that for any xi and xj normalized to the unit norm, i.e. kxik2 = kxj k2 = 1, changing the distance metric from the Euclidean distance to the cosine distance will NOT affect the nearest neighbor classification results. Specifically, for any xi, xj and x o, show that, if C(x i, xj) C(xi, xo), then E(xi, xj) E(xi, xo), where kxik2 = kxjk2 = kxok2 = 1.

2

Problem 2 Nearest Neighbor Classification and Decision Trees (4 points)

Assume we have a dataset, each data x is a 100 dimensional binary vector, i.e. x 2 f0, 1g 100. Can we have a decision tree to classify the dataset with zero classification error? Can we specify a 1-NN to result in exactly the same classification as our decision tree? For both questions explain why or why not with (counter) examples. You can assume that all data points are distinct, i.e. 8xi, xj in the dataset, xi 6= xj.

3

Problem 3 Nearest Neighbor Classification and Decision Trees (6 points)

Assume we have a decision tree in Fig. 1 which classifies x 2 C2, and C = Z n fA, Bg. In other words, A and B are integers and each dimension of x is an integer excluding A and B. Can this decision tree be implemented as a 1-NN? If so, explicitly write down each of the values you use for the 1-NN (you should use the minimal number possible). If not, either explain why or provide a counter example.

Figure 1: A decision tree example.

4

Problem 4 Decision Tree (8 points)

In this problem, you are given four 2-dimensional data points as shown in Table 1:

x1

x2

label y

0

0

0

0

1

1

1

0

1

1

1

0

Table 1: Four 2-dimensional data points and their labels.

Figure 2: The plus sign means label y = 0 and the circle means have y = 1.

4.1 Fig. 3 is a decision tree of the given data with zero training error.

Figure 3: The decision tree with zero training error.

Suppose now you have two test data points:

x1

x2

label y

0.2

0.2

0

0.2

0.8

1

What would be your test error based on decision tree in Fig. 3? (Define the test error as the fraction of

mis-classifications made on the testing set.) (2 points)

5

4.2 Now consider a new decision tree in Fig. 4. Note that the depth of the new decision tree is 1, and it does not have zero training error for the given data anymore.

Figure 4: The decision tree with depth = 1.

Given the two test data points from Question 4.1, what would be the test error using the new decision tree? (2 points)

4.3 Is the decision tree in Fig. 4 a linear or non-linear classifier in terms of (x1, x2) (Yes/No)? Can you classify the given data in Table 1 and get zero classification error by drawing a depth-1 decision tree similar to Fig. 4 (Yes/No)? Note that the decision rule should be based on a single variable (x1 or x2) in the rectangle

(e.g., x1 > 1 or x2 < 2). (2 points)

4.4 If you can put linear combination of variables in the rectangle (e.g.: ax1 + bx2 c or ax1 + bx2 < c , where a, b, c should be numbers), can you classify the given data in Table ?? and get zero classification error by drawing a depth-1 decision tree similar to Fig. 4 (Yes/No)? Please also briefly justify your answer. (2 points)

6

Problem 5 Decision Trees (12 points)

Consider a binary dataset with 400 examples, where half of them belongs to class A and the other half belongs to class B.

Next consider two decision stumps (i.e. trees with depth 1) T1 and T2, each with two children. For T1, its left child has 150 examples in class A and 50 examples in class B; for T2, its left child has 0 example in class A and 100 examples in class B. (You should infer what are in the right child.)

5.1 For each leaf of T1 and T2, compute the corresponding classification error, entropy (base e) and Gini impurity, rounding off your answer to 2 decimal places. (Note: the value/prediction of each leaf is the

majority class among all examples that belong to that leaf.) (6 points)

7

5.2 Compare the quality of T1 and T2 (that is, the two different splits of the root) based on classification error, conditional entropy (base e), and weighted Gini impurity respectively, rounding off your answer to 2

decimal places. (6 points)

8

Problem 6 Naive Bayes (12 points)

In this problem, we will try to predict whether it is suitable for playing tennis or not based on the weather condition, the emotion and the amount of homework, using Naive Bayes Classifier. You can think ‘PlayTen-nis’ is a label and ‘PlayTennis = Yes’ means it is suitable for playing tennis. We assume the probability P(Weather, Emotion, HomeworkjPlayTennis) can be factorized into the product form such that

P(Weather, Emotion, HomeworkjPlayTennis) =

P(WeatherjPlayTennis) P(EmotionjPlayTennis) P(HomeworkjPlayTennis).

The training data is as following. Each data point has three attributes (Weather, Emotion, Homework),

where Weather 2

Sunny, Cloudy , Emotion 2

Happy, Normal, Unhappy , Homework 2

Much, Little .

Weather

Emotion

Homework

PlayTennis

Sunny

Happy

Little

Yes

Sunny

Normal

Little

Yes

Cloudy

Happy

Much

Yes

Cloudy

Unhappy

Little

Yes

Sunny

Unhappy

Little

No

Cloudy

Normal

Much

No

6.1 What are the probabilities of P(PlayTennis = Yes) and P(PlayTennis = No)? Each of your answer

should be an irreducible fraction. (2 points)

6.2 Write down the following conditional probabilities. Each of your answer should be an irreducible

fraction. (6 points)

P(Weather = SunnyjPlayTennis = Yes) =?

P(Emotion = NormaljPlayTennis = Yes) =?

P(Homework = MuchjPlayTennis = Yes) =?

9

6.3 Given the new data instance x = (Wheather = Sunny, Emotion = Normal, Homework = Much), which of the following has larger value: P(PlayTennis = Yesjx) or P(PlayTennis = Nojx)? Each of your

answer should be an irreducible fraction. (4 points)

10