$30.00

## Description

Submission: You need to submit one file through MarkUs^{1}:

• Your answers to Questions 1, 2, and 3 as a PDF file titled hw2_writeup.pdf. You can produce the file however you like (e.g. L^{A}T_{E}X, Microsoft Word, scanner), as long as it is readable.

Neatness Point: One of the 10 points will be given for neatness. You will receive this point as long as we don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.

Weekly homeworks are individual work. See the Course Information handout^{2 } for detailed policies.

1. [4pts] Information Theory. The goal of this question is to help you become more familiar with the basic equalities and inequalities of information theory. They appear in many contexts in machine learning and elsewhere, so having some experience with them is quite helpful. We review some concepts from information theory, and ask you a few questions.

_{Recall} _{the } _{definition } _{of} _{the } _{e}_{n}_{tro}_{p}_{y } _{of} _{a} _{discrete } _{random } _{v}_{ariable } _{X } _{with } _{probabili}_{t}_{y} _{mass}

__ ____ ___{1}__ ____ __

^{function } ^{p: } ^{H} ^{(X} ^{)} ^{=} ^{P}_{x} ^{p(x)} ^{lo}^{g}_{2}

p(x)

. Here the summation is over all possible values of

x ∈ X , which (for simplicity) we assume is finite. For example, X might be {1, 2, . . . , N }.

(a) [1pt] Prove that the entropy H (X ) is non-negative.

An important concept in information theory is the relative entropy or the KL-divergence of two distributions p and q. It is defined as

KL(p||q) = ^{X} p(x) log_{2}

x

_{p}_{(}_{x}_{)}

. q(x)

The KL-divergence is one of the most commonly used measure of difference (or divergence) between two distributions, and it regularly appears in information theory, machine learning, and statistics.

If two distributions are close to each other, their KL divergence is small. If they are exactly the same, their KL divergence is zero. KL divergence is not a true distance metric (since it isn’t symmetric and doesn’t satisfy the triangle inequality), but we often use it as a measure of dissimilarity between two probability distributions.

(b) [2pt] Prove that KL(p||q) is non-negative. Hint: you may want to use Jensen’s Inequal- ity, which is described in the Appendix.

(c) [1pt] The Information Gain or Mutual Information between X and Y is I (Y ; X ) =

_{H} _{(}_{Y} _{)} _{−} _{H} _{(}_{Y} _{|}_{X} _{). } _{Sh}_{o}_{w} _{that}

I (Y ; X ) = KL(p(x, y)||p(x)p(y)),

__ ____ ____ ____where ____ ____p(x) ____ ____=____ __^{P}_{y}__ ____p(x,____ ____y____) ____ ____is ____ ____t__he marginal distribution of X .

^{1} https://markus.teach.cs.toronto.edu/csc411–2018–09

^{2} ^{http://www.cs.toronto.edu/}_{~}^{rgrosse/courses/csc411_f18/syllabus.pdf}

2

2. [2pts] Benefit of Averaging. Consider m estimators h_{1}, . . . , h_{m} , each of which accepts an input x and produces an output y, i.e., y_{i} = h_{i} (x). These estimators might be generated through a Bagging procedure, but that is not necessary to the result that we want to prove. Consider the squared error loss function L(y, t) = ^{1} (y − t)^{2}. Show that the loss of the average

estimator

_{h}¯ _{(x)} _{=} ^{1}

^{m}

_{m}

^{X} h_{i} (x),

i=1

is smaller than the average loss of the estimators. That is, for any x and t, we have

_{m}

^{≤}

_{L(}_{h}^{¯} _{(x),} _{t) } ^{1}

^{m}

^{X} _{L}_{(}_{h}_{i} _{(x),} _{t}_{)}_{.}

i=1

Hint: you may want to use Jensen’s Inequality, which is described in the Appendix.

3. [3pts] AdaBoost. The goal of this question is to show that the AdaBoost algorithm changes the weights in order to force the weak learner to focus on difficult data points. Here we consider the case that the target labels are from the set {−1, +1} and the weak learner also returns a classifier whose outputs belongs to {−1, +1} (instead of {0, 1}). Consider the t-th iteration of AdaBoost, where the weak learner is

_{N}

h_{t} ← argmin ^{X} w_{i} I{h(x^{(i}^{)}) = t^{(}^{i}^{)}},

h∈H

the w–weighted classification error is

i=1

i=1

err_{t} =

^{P}^{N}__ ____ ____ ____w___{i} __I____{____h___{t} __(____x__^{(}^{i}^{)} __) ____ ____=____ ____t__^{(}^{i}^{)} __}__

^{P}N

^{,}

i=1 ^{w}^{i}

_{and } _{the } _{classifier} _{c}_{o}_{efficie}_{n}_{t} _{is} _{α}_{t} _{=} __1__ _{log} __1____−____err___{t} _{. } _{(Here, } _{log} _{denotes } _{the } _{natural} _{logarithm.)}

2 err_{t}

^{AdaB}^{o}^{ost } ^{c}^{hanges } ^{the } ^{w}^{eig}^{h}^{ts } ^{of} ^{ea}^{c}^{h } ^{sample } ^{de}^{p}^{ending } ^{on} ^{whether } ^{the } ^{w}^{eak } ^{learner } ^{h}_{t}

i

classifies it correctly or incorrectly. The updated weights for sample i is denoted by w^{0} and is

^{w}^{0}

_{i} ← w_{i} exp

−α_{t} t^{(}^{i}^{)} h_{t} (x^{(i}^{)}) .

_{Sh}_{o}_{w} _{that} _{the} _{error} _{w.r.t. } _{(}_{w}_{0} _{,} _{.} _{.} _{.} _{,} _{w}_{0}

_{)} _{is} _{exactly } _{1} _{.} _{That} _{is,} _{sh}_{o}_{w} _{that}

^{1 } ^{N } 2

_{P}_{N}

t

0

err^{0} =

__ ____ ____ ___{i=1} _{w}_{i} _{I}_{{}_{h}_{t} _{(x}^{(i}^{)}_{)}__ ___{= }__ ___{t}^{(i}^{)}_{}} _{1}

_{i=1} ^{w}^{0}

^{P}^{N} ^{=} _{2} ^{.}

^{i}

Note that here we use the weak learner of iteration t and evaluate it according to the new weights, which will be used to learn the t + 1-st weak learner. What is the interpretation of this result?

Tips:

t

• Start from err^{0} and divide the summation to two sets of E = {i : I{h_{t} (x^{(i}^{)}) = t^{(i}^{)}}} and

its complement E^{c}

^{=} ^{{}^{i} ^{:} ^{I}^{{}^{h}_{t} ^{(}^{x}^{(}^{i}^{)} ^{)} ^{=} ^{t}^{(}^{i}^{)} ^{}}}^{.}

_{•} _{Note} _{that } _{P}

__ ____ ____ ___{i}_{∈}_{E}__ __^{w}__i__ _{=} _{err} _{.}

^{P}N

t i=1 ^{w}^{i}

Appendix: Convexity and Jensen’s Inequality. Here, we give some background on con- vexity which you may find useful for some of the questions in this assignment. You may assume anything given here.

Convexity is an important concept in mathematics with many uses in machine learning. We briefly define convex set and function and some of their properties here. Using these properties are useful in solving some of the questions in the rest of this homework. If you are interested to know more about convexity, refer to Boyd and Vandenberghe, Convex Optimization, 2004.

A set C is convex if the line segment between any two points in C lies within C , i.e., if for any

^{x}_{1}^{,} ^{x}_{2 } ^{∈} ^{C} ^{and} ^{for} ^{a}^{n}^{y} ^{0} ^{≤} ^{λ} ^{≤} ^{1,} ^{w}^{e} ^{h}^{av}^{e}

λx_{1 } + (1 − λ)x_{2} ∈ C.

For example, a cube or sphere in R^{d } are convex sets, but a cross (a shape like X) is not.

A function f : R^{d } → R is convex if its domain is a convex set and if for all x_{1}, x_{2 } in its domain, and for any 0 ≤ λ ≤ 1, we have

f (λx_{1} + (1 − λ)x_{2} ) ≤ λf (x_{1}) + (1 − λ)f (x_{2} ).

This inequality means that the line segment between (x_{1}, f (x_{1} )) and (x_{2}, f (x_{2})) lies above the graph of f . A convex function looks like `. We say that f is concave if −f is convex. A concave function looks like a.

Some examples of convex and concave functions are (you do not need to use most of them in your homework, but knowing them is useful):

• Powers: x^{p } is convex on the set of positive real numbers when p ≥ 1 or p ≤ 0. It is concave for 0 ≤ p ≤ 1.

• Exponential: e^{ax }is convex on R, for any a ∈ R.

• Logarithm: log(x) is concave on the set of positive real numbers.

• Norms: Every norm on R^{d } is convex.

• Max function: f (x) = max{x_{1}, x_{2} , . . . , x_{d} } is convex on R^{d} .

• Log-sum-exp: The function f (x) = log(e^{x}^{1} + . . . + e^{x}^{d} ) is convex on R^{d} .

An important property of convex and concave functions, which you may need to use in your homework, is Jensen’s inequality. Jensen’s inequality states that if φ(x) is a convex function of x, we have

φ(E [X ]) ≤ E [φ(X )] .

In words, if we apply a convex function to the expectation of a random variable, it is less than or equal to the expected value of that convex function when its argument is the random variable. If the function is concave, the direction of the inequality is reversed.

^{Jensen’s } ^{inequali}^{t}^{y } ^{has} ^{a} ^{p}^{h}^{ysical } ^{i}^{n}^{terpretation: } ^{Consider } ^{a} ^{set } ^{X} ^{=} ^{{}^{x}_{1} ^{,} ^{.} ^{.} ^{.} ^{,} ^{x}_{N} ^{}} ^{of} ^{p}^{oi}^{n}^{ts}

on R. Corresponding to each point, we have a probability p(x_{i} ). If we interpret the probability as mass, and we put an object with mass p(x_{i} ) at location (x_{i} , φ(x_{i} )), then the centre of gravity of these objects, which is in R^{2}, is located at the point (E [X ] , E [φ(X )]). If φ is convex `, the centre of gravity lies above the curve x 7→ φ(x), and vice versa for a concave function a.