$30.00
Description
Submission
Submit your solutions electronically on the course Gradescope site as PDF les.
If you plan to typeset your solutions, please use the LaTeX solution template. If you must submit scanned handwritten solutions, please use a black pen on blank white paper and a highquality scanner app.
Design (specify for) a twoinput perceptron (with an additional bias or o set term) that computes the following boolean functions. Assume T = 1 and F = 1. If a valid perceptron exists, show that it is not unique by designing another valid perceptron (with a di erent hyperplane, not simply through normalization). If no perceptron exists, state why.
(a) AND (b) XOR

Logistic Regression [10 pts]
Consider the objective function that we minimize in logistic regression:
N
X
J( ) = [y_{n} log h (x_{n}) + (1 y_{n}) log (1 h (x_{n}))]
n=1

Find the partial derivatives ^{@J} .
@ _{j}

Locally Weighted Linear Regression [10 pts]
Consider a linear regression problem in which we want to \weight” di erent training instances di erently because some of the instances are more important than others. Speci cally, suppose we want to minimize
N
X
J( _{0}; _{1}) = w_{n} ( _{0} + _{1}x_{n;1} y_{n})^{2}:
n=1
Here w_{n} > 0. In class, we worked out what happens for the case where all the weights (the w_{n}‘s) are the same. In this problem, we will generalize some of those ideas to the weighted setting.

Calculate the gradient by computing the partial derivatives of J with respect to each of the parameters ( _{0}; _{1}).

Set each partial derivatives to 0 and solve for _{0} and _{1} to obtain values of ( _{0}; _{1}) that minimize J.
De nition 1 (Linear Program) A linear program can be stated as follows:
~ 
R 
m 
, and ~c 2 R 
n 
~ 
2 
R 
n 
, that minimizes the 

Let A be an m n realvalued matrix, b 2 
. Find a t 

linear function 

~ 
T_{~} 

z(t) = ~c 
t 

subject to 
~ 
~ 

At 
b 
~
In the linear programming terminology, ~c is often referred to as the cost vector and z(t) is referred to as the objective function.^{1} We can use this framework to de ne the problem of learning a linear discriminant function.^{2}
The Learning Problem:^{3} Let x~_{1}; x~_{2}; : : : ; x~_{m} represent m samples, where each sample x~_{i} 2 R^{n} is an ndimensional vector, and ~y 2 f 1; 1g^{m} is an m 1 vector representing the respective labels of each of the m samples. Let w~ 2 R^{n} be an n 1 vector representing the weights of the linear discriminant function, and be the threshold value.
We predict x~_{i} to be a positive example if w~^{T} x~_{i} + 0. On the other hand, we predict x~_{i} to be a negative example if w~^{T} x~_{i} + < 0.
We hope that the learned linear function can separate the data set. That is,

i
(
1 if w~^{T} x~_{i} + < 0:
y
=
1
if w~^{T} x~_{i} +0
In order to nd a good linear separator, we propose the following linear program:
min
subject to y_{i}(w~^{T} x~_{i} + ) 1 ; 8(x~_{i}; y_{i}) 2 D
0
where D is the data set of all training examples.
(1)
(2)

A data set D = f(x~_{i}; y_{i})g^{m}_{i=1} that satis es condition (1) above is called linearly separable. Show that if D is linearly separable, there is an optimal solution to the linear program (2) with = 0

Now show that there is an optimal solution with = 0, then D is linearly separable.

What can we say about the linear separability of the data set if there exists a hyperplane that satis es condition (2) with > 0?

An alternative LP formulation to (2) may be
min
subject to y_{i}(w~^{T} x~_{i} + ) ; 8(x~_{i}; y_{i}) 2 D
0
Find the optimal solution to this formulation (independent of D) to illustrate the issue with such a formulation.
(e) Let x~_{1} 2 R^{n}, x~_{1}^{T} = 1 1 1 
and y_{1} = 1. Let x~_{2} 2 R^{n}, x~_{2}^{T} = 
1 
1 1 
and y_{2} = 1. 

The data set D is de ned as 
D = f(x~_{1}; y_{1}); (x~_{2}; y_{2})g: 

Consider the formulation in (2) applied to D. What are possible optimal solutions?
In this exercise, you will work through linear and polynomial regression. Our data consists of inputs x_{n} 2 R and outputs y_{n} 2 R; n 2 f1; : : : ; Ng, which are related through a target function y = f(x). Your goal is to learn a linear predictor h (x) that best approximates f(x). But this time, rather than using scikitlearn, we will further open the \blackbox”, and you will implement the regression model!
code and data
code : regression.py
data : regression_train.csv, regression_test.csv
This is likely the rst time that many of you are working with numpy and matrix operations within a programming environment. For the uninitiated, you may nd it useful to work through a numpy tutorial rst.^{4} Here are some things to keep in mind as you complete this problem:
If you are seeing many errors at runtime, inspect your matrix operations to make sure that you are adding and multiplying matrices of compatible dimensions. Printing the dimensions of variables with the X.shape command will help you debug.
When working with numpy arrays, remember that numpy interprets the * operator as elementwise multiplication. This is a common source of size incompatibility errors. If you want matrix multiplication, you need to use the dot function in Python. For example, A*B does elementwise multiplication while dot(A,B) does a matrix multiply.
Be careful when handling numpy vectors (rank1 arrays): the vector shapes 1 N, N 1, and N are all di erent things. For these dimensions, we follow the the conventions of scikitlearn’s LinearRegression class^{5}. Most importantly, unless otherwise indicated (in the code documentation), both column and row vectors are rank1 arrays of shape N, not rank2 arrays of shape N 1 or shape 1 N.
Visualization [5 pts]
As we learned last week, it is often useful to understand the data through visualizations. For this data set, you can use a scatter plot to visualize the data since it has only two properties to plot (x and y).

Visualize the training and test data using the plot_data(…) function. What do you observe? For example, can you make an educated guess on the e ectiveness of linear regression in predicting the data?
Linear Regression [25 pts]
Recall that linear regression attempts to minimize the objective function
N
X
J( ) = (h (x_{n}) y_{n})^{2}:
n=1
In this problem, we will use the matrixvector form where

0
y_{1}
1
0
x_{1}^{T}
1
y_{2}
x_{2}^{T}
y =
B
_{.}.^{.}
C
;
X =
B
_{.}.^{.}
C
;
B
y
N
C
B
_{x}T
C
B
C
B
N
C
@
A
@
A
and each instance x_{n} = 1; x_{n;1}; : : : ; x_{n;D} ^{T} .
In this instance, the number of input features D = 1.

1
_{0}


_{1} C

C _{=} B _{2} C

C

._{..} C @ A

_{D}
Rather than working with this fully generalized, multivariate case, let us start by considering a simple linear regression model:
h (x) = ^{T }x = _{0} + _{1}x_{1}
regression.py contains the skeleton code for the class PolynomialRegression. Objects of this class can be instantiated as model = PolynomialRegression (m) where m is the degree of the polynomial feature vector where the feature vector for instance n, 1; x_{n;1}; x^{2}_{n;1}; : : : ; x^{m}_{n;1} ^{T} . Setting m = 1 instantiates an object where the feature vector for instance n, 1; x_{n;1} ^{T} .

Note that to take into account the intercept term ( _{0}), we can add an additional \feature” to each instance and set it to one, e.g. x_{i;0} = 1. This is equivalent to adding an additional rst column to X and setting it to all ones.
Modify PolynomialRegression.generate_polynomial_features(…) to create the matrix X for a simple linear model.

Before tackling the harder problem of training the regression model, complete PolynomialRegression.predict(…) to predict y from X and .

One way to solve linear regression is through gradient descent (GD).
Recall that the parameters of our model are the _{j} values. These are the values we will adjust to minimize J( ). In gradient descent, each iteration performs the update
N
X
_{j} _{j} 2 (h (x_{n}) y_{n}) x_{n;j} (simultaneously update _{j} for all j):
n=1
With each step of gradient descent, we expect our updated parameters _{j} to come closer to the parameters that will achieve the lowest value of J( ).
6
As we perform gradient descent, it is helpful to monitor the convergence by computing
the cost, i.e., the value of the objective function J. Complete PolynomialRegression.cost(…) to calculate J( ).
If you have implemented everything correctly, then the following code snippet should return 40:234.
train_data = load_data( regression_train.csv )
model = PolynomialRegression()
model.coef_ = np.zeros(2)
model.cost(train_data.X, train_data.y)
Next, implement the gradient descent step in PolynomialRegression.fit_GD(…). The loop structure has been written for you, and you only need to supply the updates to and the new predictions y^ = h (x) within each iteration.
We will use the following speci cations for the gradient descent algorithm: { We run the algorithm for 10; 000 iterations.
{ We terminate the algorithm ealier if the value of the objective function is unchanged across consecutive iterations.
{ We will use a xed step size.
So far, you have used a default learning rate (or step size) of = 0:01. Try di erent = 10 ^{4}; 10 ^{3}; 10 ^{2}; 0:0407, and make a table of the coe cients, number of iterations until convergence (this number will be 10; 000 if the algorithm did not converge in a smaller number of iterations) and the nal value of the objective function. How do the coe cients compare? How quickly does each algorithm converge?

In class, we learned that the closedform solution to linear regression is = (X^{T} X) ^{1}X^{T} y:
Using this formula, you will get an exact solution in one calculation: there is no \loop until convergence” like in gradient descent.
Implement the closedform solution PolynomialRegression.fit(…).
What is the closedform solution? How do the coe cients and the cost compare to those obtained by GD? How quickly does the algorithm run compared to GD?

Finally, set a learning rate for GD that is a function of k (the number of iterations) (use _{k} = _{1+}^{1}_{k} ) and converges to the same solution yielded by the closedform optimization (minus possible rounding errors). Update PolynomialRegression.fit_GD(…) with your proposed learning rate. How long does it take the algorithm to converge with your proposed learning rate?
Polynomial Regression[15 pts]
Now let us consider the more complicated case of polynomial regression, where our hypothesis is
h (x) = ^{T} (x) = _{0} + _{1}x + _{2}x^{2} + : : : + ^{m}x^{m}:
7

Recall that polynomial regression can be considered as an extension of linear regression in which we replace our input matrix X with

0
(x_{1})^{T}
1
(x_{2})^{T}
=
B
_{.}.^{.}
C
;
B
(x
N
)^{T}
C
B
C
@
A
where (x) is a function such that _{j}(x) = x^{j} for j = 0; : : : ; m.
Update PolynomialRegression.generate_polynomial_features(…) to create an m + 1 dimensional feature vector for each instance.

Given N training instances, it is always possible to obtain a \perfect t” (a t in which all
the data points are exactly predicted) by setting the degree of the regression to N 1. Of course, we would expect such a t to generalize poorly. In the remainder of this problem, you will investigate the problem of over tting as a function of the degree of the polynomial, m. To measure over tting, we will use the RootMeanSquare (RMS) error, de ned as
p
E_{RMS} = J( )=N ;
where N is the number of instances.^{6}
Why do you think we might prefer RMSE as a metric over J( )?
Implement PolynomialRegression.rms_error(…).

For m = 0; : : : ; 10, use the closedform solver to determine the best t polynomial regression model on the training data, and with this model, calculate the RMSE on both the training data and the test data. Generate a plot depicting how RMSE varies with model complexity (polynomial degree) { you should generate a single plot with both training and test error, and include this plot in your writeup. Which degree polynomial would you say best ts the data? Was there evidence of under/over tting the data? Use your plot to justify your answer.