Problem Set 3 (Least Squares, Ridge Regression, and overfitting) Solution

$35.00 $24.00

  Goals.   The goal of this exercise is to   Implement and debug least-squares.   Implement, debug and visualize basis function models. Understand overfitting.   Implement ridge regression.   Setup, data and sample code.    Obtain the folder labs/ex03 of the course github repository   github.com/epfml/ML course   We will continue to use the dataset height…

You’ll get a: . zip file solution

 

 
Categorys:

Description

5/5 – (2 votes)

 

Goals.   The goal of this exercise is to

 

Implement and debug least-squares.

 

Implement, debug and visualize basis function models. Understand overfitting.

 

Implement ridge regression.

 

Setup, data and sample code.    Obtain the folder labs/ex03 of the course github repository

 

github.com/epfml/ML course

 

We will continue to use the dataset height weight genders.csv as well as a new dataset dataEx3.csv in this exercise. We have provided sample code that already contains useful snippets of code required for this exercise.

 

You will be working in the notebook ex03.ipynb for the exercises of this week, by lling in the corresponding functions in the provided template code.

 

 

  • Least Squares and Linear Basis Functions Models

 

1.1      Least squares

 

Exercise 1:

 

Fill in the notebook function least squares(y, tx) which implements the solution of the normal equa-tions as discussed in the class. This function should return the optimal weights, and the mean-squared error.

 

To debug your code, you can use the output of the last exercise. Run gradient descent or grid search on the height-weight data from the last exercise, and make sure you get a similar resulting w vector using all three methods.

 

This is a useful method to debug your code, i.e. rst implementing a simple method and then using it to check more complicated methods. If you have not nished Exercise 2, please rst nish implementing the grid search method. If you are lagging behind, do not worry. You will get the opportunity to catch up later, but it is important that you eventually take time to nish previous exercises.

 

 

 

1.2    Least squares with a linear basis function model

 

We will now implement and visualize a basis function model for the data dataEx3.csv.

 

As explained in the class, linear regression might not be directly suitable for nonlinear data. We will use polynomial basis functions to t nonlinear data.

 

j(x) := xj (1)

 

 

As we have seen in the lecture notes, the technique of feature expansion by the linear basis function model does allow us to still use linear regression techniques, to t nonlinear data (recall that in our rst simple setting, we assume that each input point is just one real value). As a result, we will be able to t the data using di erent degrees of polynomials, e.g. a degree two polynomial (which is a linear combination of 1, x and x2), or a degree three polynomial (which is a linear combination of 1, x, x2 and x3), etc.. Higher degree polynomials are more expensive to compute and to t, but can capture ner details in the data, which results in more expressive models. Think about the pros and cons of choosing a very high or very low degree.

 

To measure the t of our model, we will use a cost function called the Root-Mean-Square-Error (RMSE). It is related to MSE as follows:

p

RMSE(w) :=       2  MSE(w)                                                                         (2)

 

The magnitude of MSE can be di cult to interpret since it involves a square, while RMSE provides a more inter-pretable measure on the same scale as the error of one point. There are better measures in terms of statistical properties, like R2, but we don’t need these for now. See the book \Introduction to Statistical learning” if you’re interested in more details.

 

Let us now implement polynomial regression, using the technique of linear basis functions, and visualize the predictions.

 

Exercise 2:

 

The goal of this exercise is to plot the data along with predictions using polynomial regression. Your goal is to nd a good w using polynomial regression, when using polynomials of degrees 1, 3, 7, and 12 respectively. You might want to reuse the function from the previous exercise to calculate the RMSE.

 

Fill in the notebook function build poly(x, degree). The input of this function is the vector of the data

examples xn 2 R for 1    n     N. As an output, the function must return the extended feature matrix

  2 (x1) 3    
  6 :: 7    
:= 6 (xn) 7 where   (xn) := [1; xn; xn2 ; xn3; : : : ; xndegree]
e (xN )    
  6 :: 7    
  4 5    
       

that is the matrix formed by applying the polynomial basis functions to all input data, for the degree of j = 0 up to j =degree.

 

When nished, you must COPY your implementation to the separate le build polynomial.py for the plot function to work.

 

If the code runs successfully, you will see the data and the t. You will clearly see why linear regression is not a good t, while polynomial regression produces a better t.

 

Filling in the notebook function polynomial regression(), you can see that RMSE decreases as we increase the degree of the polynomial. Does it mean that the t gets better as we increase the degree? Which t is the best in your view?

 

 

 

 

  • Evaluating Model Prediction Performance

 

The answer to the last question should be clear if you followed the lecture. If not, discuss with others and clarify.

 

In practice, it matters that predictions are good for unseen examples, not only for training examples. To simulate the reality, we will now split our dataset into two parts: training and testing. We will t the data using training data and compute RMSE on both test and training data.

 

Exercise 3:

 

The notebook function train test split demo() is supposed to show the train and test splits for various polynomial degrees.

 

 

2

 

 

To split the data, please ll in the notebook function split data(x, y, ratio, …). Do you think that the order of samples is important when doing the split?

 

Fill in the notebook function train test split demo(). If the code runs successfully, you will see RMSE values printed for degrees 3, 7 and 12. For each degree, there are again three RMSE values which correspond to the following three splits of the data.

 

{ 90% training, 10% testing { 50% training, 50% testing { 10% training, 90% testing

 

Look at the training and test RMSE for degree 3. Does this makes sense? Why? Discuss with others if you are unclear.

 

Now look at RMSE for other two degrees. Do these make sense? Why? Discuss with others if you are unclear.

 

Which split is better? Why? Refer to the lecture notes if unclear.

 

The test RMSE for degree 12 is ridiculously high for the split 10%-90%. Why do you think this is the case? The answer lies in numerical inaccuracies. Make sure you understand this.

 

BONUS: Imagine you have 5000 samples instead of 50. Which split might be better in that situation?

 

 

 

 

  • Ridge Regression

 

The previous exercise shows overfitting when using complex models. Let us now correct it using Ridge Regression, as discussed in the class.

 

Fill in the notebook function ridge regression(). You can debug your code by setting = 0. This should essentially give the same answer as least-squares code. You can also check that for large value of lambda, RMSE should be really bad.

 

Play with the demo ridge regression demo() by choosing a split of 50%-50% and plot train and test errors vs for polynomial degree 7. You should get a similar plot as Figure 1.

 

 

 

 

 

Figure 1: Ridge Regression Demo.

 

 

 

 

 

Theory Exercises

 

  1. Warm-Up

 

  • Show that the sum of two convex functions is convex. Hint: use the de nition of convexity,

 

  • : X ! R is convex , 8x; y 2 X; 8 2 [0; 1] : f( x + (1    )y)    f(x) + (1    )f(y):

 

  • How do you solve the linear system Ax = b? When is it not possible, and why? Hint: Invertible matrix

 

  • What is the computational complexity of

 

Grid search?

 

(one step of) Gradient Descent for linear regression with MSE cost?

 

(one step of) Stochastic Gradient Descent for linear regression with MSE cost?

 

If needed, refresh your memory of the complexity of algebraic operations.

 

  • Consider a problem with two input variables, x = (x1; x2), and one output variable y. Given the two samples below, nd the coe cients w = (w1; w2) of the linear relationship x>w = w1x1 + w2x2 = y.

x1        x2            y

 

Sample 1    400    -201       200

 

Sample 2   -800       401     -200

 

Do the exercise again, but with a slight change in the inputs: x1 for sample 1 is now 401 instead of 400

 

  x1 x2 y
Sample 1 401 -201 200
Sample 2 -800 401 -200
       

 

Compare the resulting w = (w1; w2) for both cases. Familiarize yourself with the concept of condition number as a way to diagnose ill-conditionning. You can nd condition number calculators online or use numpy.linalg.cond).

 

  1. Cost functions

 

A cost function de nes how you evaluate a solution, and you might have di erent requirements depending on the problem. Using the MSE, if a your model makes an error of 5 on a sample, you add 25/2 to the cost of your model, regardless of the target. You might want to penalize this di erently if you care about the

 

relative error; an output of 1005 when 1000 was expected might be OK, but mistaking a 6 for a 1 might not. In this case, you can use a function that takes the relative error of the target yn into account, like this one:

Ln (f(x ; w); y ) := (f(xn; w)  yn)2 :
yn2 +
n n    

Where f is the model and    is a small constant to avoid divisions by zero. Note that we have de ned the

cost function per example here. You can imagine the total cost function being de ned as L := 1 PN                                                                                   Ln.

N      n=1

 

  • Try the function on some [prediction, target] pairs, or plot it, to see how it behaves (by hand or using Python or the wolfram alpha website, no need to code)

 

  • Compute its gradient, assuming a standard linear regression f(xn; w) := x>nw

 

  • How would you implement the gradient? Again, no need to code – try to nd a formula using standard matrix operations, along with element-wise multiplication and summation/product over columns/rows.

 

  • How sensitive is this function to outliers? Compare two cases: the target is 1, but in one case our model assigns it 10, and in the other 100. (e.g. with ” = 1) How does the error changes? Compare with the following cost function, for the n data example:

 

L0n(f(xn; w); yn) :=  log(f(xn; w) + 1)         log(yn + 1) 2:

 

Note: The higher the error on a sample is, relative to the other samples, the more your model will try to t this sample.

 

 

 

 

4