$30.00
Description
Instructions
Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named ‘Theory Assignment 1’ set up in which you can submit your files. Please be sure to follow all directions outlined here.
You can submit multiple times, but only the last submission counts. That means if you finish some problems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (remember Murphy’s Law), you still get credit for whatever you have turned in.
Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submission must be a single PDF. It is strongly recommended that you typeset with L^{A}T_{E}X. There are many free integrated L^{A}T_{E}X editors that are convenient to use (e.g Overleaf, ShareLaTeX). Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not know how to use L^{A}T_{E}X yet.
Please also follow the rules below:

The file should be named as firstname lastname USCID.pdf e.g., Don Quijote de la Mancha 8675309045.pdf).

Do not have any spaces in your file name when uploading it.

Please include your name and USCID in the header of your report as well.
Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration. Review the rules on academic conduct in the syllabus: a single instance of plagiarism can adversely affect you significantly more than you could stand to gain.
Notes on notation:

Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

k.k means L2norm unless specified otherwise i.e. k.k = k.k_{2}
1
Problem 1 Nearest Neighbor Classification (8 points)
In the class, we use the Euclidean distance as the distance metric for the nearest neighbor classification.
Given data x 2 R^{D}, the Euclidean distance between x_{i} and x_{j} is defined as following:

2
D
E(x_{i}, x_{j}) = x_{i}
x_{j}
= å(x_{id} x_{jd})^{2}
(1)
2
d=1
In some applications such as information retrieval and neural language processing, the cosine distance is widely applied. It is defined as:
C(x_{i}, x_{j}) = 1
where the norm of x is defined as
x_{i}^{T} x_{j} 
= 1 
å_{d}^{D}_{=}_{1}(x_{id} x_{jd}) 
, 
kx_{i}k_{2}kx_{j}k_{2} 

kx_{i}k_{2}kx_{j}k_{2} 
v
u _{D}
_{k}_{x}_{k} _{=} ^{u} å _{x}2_{.}
2 ^{t} _{d}
d=1
(2)
(3)
Now you are asked to prove that for any x_{i} and x_{j} normalized to the unit norm, i.e. kx_{i}k_{2} = kx_{j} k_{2} = 1, changing the distance metric from the Euclidean distance to the cosine distance will NOT affect the nearest neighbor classification results. Specifically, for any x_{i}, x_{j} and x _{o}, show that, if C(x _{i}, x_{j}) C(x_{i}, x_{o}), then E(x_{i}, x_{j}) E(x_{i}, x_{o}), where kx_{i}k_{2} = kx_{j}k_{2} = kx_{o}k_{2} = 1.
2
Problem 2 Nearest Neighbor Classification and Decision Trees (4 points)
Assume we have a dataset, each data x is a 100 dimensional binary vector, i.e. x 2 f0, 1g ^{100}. Can we have a decision tree to classify the dataset with zero classification error? Can we specify a 1NN to result in exactly the same classification as our decision tree? For both questions explain why or why not with (counter) examples. You can assume that all data points are distinct, i.e. 8x_{i}, x_{j} in the dataset, x_{i} 6= x_{j}.
3
Problem 3 Nearest Neighbor Classification and Decision Trees (6 points)
Assume we have a decision tree in Fig. 1 which classifies x 2 C^{2}, and C = Z n fA, Bg. In other words, A and B are integers and each dimension of x is an integer excluding A and B. Can this decision tree be implemented as a 1NN? If so, explicitly write down each of the values you use for the 1NN (you should use the minimal number possible). If not, either explain why or provide a counter example.
Figure 1: A decision tree example.
4
Problem 4 Decision Tree (8 points)
In this problem, you are given four 2dimensional data points as shown in Table 1:

^{x}1
x_{2}
label y
0
0
0
0
1
1
1
0
1
1
1
0
Table 1: Four 2dimensional data points and their labels.
Figure 2: The plus sign means label y = 0 and the circle means have y = 1.
4.1 Fig. 3 is a decision tree of the given data with zero training error.
Figure 3: The decision tree with zero training error.
Suppose now you have two test data points:

x_{1}
x_{2}
label y
0.2
0.2
0
0.2
0.8
1
What would be your test error based on decision tree in Fig. 3? (Define the test error as the fraction of
misclassifications made on the testing set.) (2 points)
5
4.2 Now consider a new decision tree in Fig. 4. Note that the depth of the new decision tree is 1, and it does not have zero training error for the given data anymore.
Figure 4: The decision tree with depth = 1.
Given the two test data points from Question 4.1, what would be the test error using the new decision tree? (2 points)
4.3 Is the decision tree in Fig. 4 a linear or nonlinear classifier in terms of (x_{1}, x_{2}) (Yes/No)? Can you classify the given data in Table 1 and get zero classification error by drawing a depth1 decision tree similar to Fig. 4 (Yes/No)? Note that the decision rule should be based on a single variable (x_{1} or x_{2}) in the rectangle
(e.g., x_{1} > 1 or x_{2} < 2). (2 points)
4.4 If you can put linear combination of variables in the rectangle (e.g.: ax_{1} + bx_{2} c or ax_{1} + bx_{2} < c , where a, b, c should be numbers), can you classify the given data in Table ?? and get zero classification error by drawing a depth1 decision tree similar to Fig. 4 (Yes/No)? Please also briefly justify your answer. (2 points)
6
Problem 5 Decision Trees (12 points)
Consider a binary dataset with 400 examples, where half of them belongs to class A and the other half belongs to class B.
Next consider two decision stumps (i.e. trees with depth 1) T_{1} and T_{2}, each with two children. For T_{1}, its left child has 150 examples in class A and 50 examples in class B; for T_{2}, its left child has 0 example in class A and 100 examples in class B. (You should infer what are in the right child.)
5.1 For each leaf of T_{1} and T_{2}, compute the corresponding classification error, entropy (base e) and Gini impurity, rounding off your answer to 2 decimal places. (Note: the value/prediction of each leaf is the
majority class among all examples that belong to that leaf.) (6 points)
7
5.2 Compare the quality of T_{1} and T_{2} (that is, the two different splits of the root) based on classification error, conditional entropy (base e), and weighted Gini impurity respectively, rounding off your answer to 2
decimal places. (6 points)
8
Problem 6 Naive Bayes (12 points)
In this problem, we will try to predict whether it is suitable for playing tennis or not based on the weather condition, the emotion and the amount of homework, using Naive Bayes Classifier. You can think ‘PlayTennis’ is a label and ‘PlayTennis = Yes’ means it is suitable for playing tennis. We assume the probability P(Weather, Emotion, HomeworkjPlayTennis) can be factorized into the product form such that
P(Weather, Emotion, HomeworkjPlayTennis) =
P(WeatherjPlayTennis) P(EmotionjPlayTennis) P(HomeworkjPlayTennis).
The training data is as following. Each data point has three attributes (Weather, Emotion, Homework),
where Weather 2 
Sunny, Cloudy , Emotion 2 
Happy, Normal, Unhappy , Homework 2 
Much, Little . 

Weather 
Emotion 
Homework 
PlayTennis 

Sunny 
Happy 
Little 
Yes 

Sunny 
Normal 
Little 
Yes 

Cloudy 
Happy 
Much 
Yes 

Cloudy 
Unhappy 
Little 
Yes 

Sunny 
Unhappy 
Little 
No 

Cloudy 
Normal 
Much 
No 
6.1 What are the probabilities of P(PlayTennis = Yes) and P(PlayTennis = No)? Each of your answer
should be an irreducible fraction. (2 points)
6.2 Write down the following conditional probabilities. Each of your answer should be an irreducible
fraction. (6 points)
• P(Weather = SunnyjPlayTennis = Yes) =?
• P(Emotion = NormaljPlayTennis = Yes) =?
• P(Homework = MuchjPlayTennis = Yes) =?
9
6.3 Given the new data instance x = (Wheather = Sunny, Emotion = Normal, Homework = Much), which of the following has larger value: P(PlayTennis = Yesjx) or P(PlayTennis = Nojx)? Each of your
answer should be an irreducible fraction. (4 points)
10