Project #4: Regression Analysis Solution




Regression analysis is a statistical procedure for estimating the relationship between a target variable and a set of potentially relevant variables. In this project, we explore basic regression models on a given dataset, along with basic techniques to handle over-tting; namely cross-validation, and regularization. With cross-validation, we test for over- tting, while with regularization we penalize overly complex models.


We use a Network backup Dataset, which is comprised of simulated tra c data on a backup system over a network. The system monitors the les residing in a destination machine and copies their changes in four hour cycles. At the end of each backup process, the size of the data moved to the destination as well as the duration it took are logged, to be used for developing prediction models. We de ne a work ow as a task that backs up data from a group of les, which have similar patterns of change in terms of size over time. The dataset has around 18000 data points with the following columns/variables:

Week index

Day of the week at which the le back up has started Backup start time: Hour of the day

Work ow ID File name

Backup size: the size of the le that is backed up in that cycle in GB

Backup time: the duration of the backup procedure in hour

Problem Statement

  1. Load the dataset. You can download the dataset from this link. To get an idea on the type of relationships in your dataset:


  1. For a twenty-day period (X-axis unit is day number) plot the backup sizes for all work ows (color coded on the Y-axis),

  1. Do the same plot for the rst 105-day period.

Can you identify any repeating patterns?

  1. Predict the backup size of a le given the other attributes. We use all attributes, except Backup time, as candidate features for the prediction of backup size.

We will try di erent feature sets, as well as di erent encoding schemes for each feature and in combination. For example, each of the ve features is a categorical variable: Day of the week, hour of the day, work- ow number, le-type, and week number.

For each categorical variable, we could convert it into a one dimensional numer-

ical value. For example, Day of the Week variable could take on values 1; ; 7 corresponding to Monday through Friday. Similarly, the Hour of the Day could be encoded as 1 24. We will refer to this as a scalar encoding.

For each categorical variable that takes one of M values we can also encode it as an M dimensional vector, where only one entry is 1 and the rest are 0’s. Thus for the Day of the Week, Monday could be encoded as [1; 0; 0; 0; 0; 0; 0] and Friday as [0; 0; 0; 0; 0; 0; 1]. We will refer to this encoding as One-Hot-Encoding.

Now for the ve variables, when looked at as a set, we have 32 (= 25) possible combinations, where in each combination only a subset of the features are encoded using One-Hot-Encoding and the rest of the features are encoded using a scalar encoding.

For part a-e, for each model you need to report training and test RMSE from 10-fold cross validation as basic evaluation of the performance. That is, for each fold you get two numbers: Training RMSE and Test RMSE. In addition, you need to: (i) Plot tted values against true values scattered over the number of data points and (ii) Plot residuals versus tted values scattered over the number of data points using the whole dataset for each model with the best parameters you have found. It visualizes how well your model ts the data.

  1. Fit a linear regression model. We use ordinary least square as the penalty function.

min kY X k2

  • where the minimization is on the coe cient vector .

    1. First convert each categorical feature into one dimensional numerical values using scalar encoding (e.g. Monday to Sunday can be mapped to 1-7), and then directly use them to t a basic linear regression model.

    1. Data Preprocessing: Standardize (see the Useful Functions Section) all these numerical features, then t and test the model. How does the tting result change as shown in the plots?

    1. Feature Selection: Use f regression and mutual information regression measure to select three most important variables respectively. Report the


three most important variables you nd. Use those three most important variables to train a new linear model, does the performance improve?

  1. Feature Encoding: As explained in the preceding discussions, there are 32 possible combinations of encoding the ve categorical variables. Plot the average training RMSE and test RMSE for each combination (in range 1 to 32). Which combinations achieve best performance? Can you provide an intuitive explanation?

  1. Controlling ill-conditioning and over- ting: You should have found obvious increases in test RMSE compared to training RMSE in some com-binations, can you explain why this happens? Observe those tted coef-cients. To solve this problem, you can try the following regularizations with suitable parameters.

1. Ridge Regularizer: min kY X k2 + k k22

2. Lasso Regularizer: min kY X k2 + k k1

3. Elastic Net Regularizer: min kY X k2 + 1 k k1 + 2 k k22 (optional)

For any choice of the hyper-parameter(s) (i.e., ; 1; 2) you will nd one of the 32 models with the lowest Test-RMSE. Optimize over choices of ; 1; lambda2 to pick one good model. Compare the values of the estimated coe cients for these regularized good models, with the un-regularized best model.

  1. Use a random forest regression model for this same task.

Feature importance in random forest algorithm: During the training process, for each node, a branching decision is made based on only one feature that minimized a chosen measure of impurity. For classi cation, it is typi-cally Gini impurity or information gain(entropy) and for regression task, it is variance (see lecture notes). The importance of each feature will be the aver-aged decreased variance for each node split with this feature in the forest and weighted by the number of samples it splits.

Out of bag error: In the random forest regression, since we use bootstrapping, it’s easier and faster to evaluate the generalization ability. For each tree, only a subset of the data set is used to build it (because of sampling) so the data points that are left out can be used as the test set. One can then de ne prediction-

RMSE for each tree and then average over all trees. In sklearn random forest regression, oob score will return out of bag R2 score, so you can calcalute 1- oob score as Out Of Bag error.

Set the parameters of your model with the following initial values.

Number of trees: 20

Depth of each tree: 4

Bootstrap: True

Maximum number of features: 5

Recall that a Random Forest model can handle categorical variables without having to use one-hot or scalar encodings.

  1. Report Training and average Test RMSE from 10 fold cross validation (sum up each fold’s square error, divide by total number of data then take square root) and Out Of Bag error you get from this initial model.


    1. Sweep over number of trees from 1 to 200 and maximum number of features from 1 to 5, plot gure 1 for out of bag error(y axis) against number of trees(x axis), gure 2 for average Test-RMSE(y axis) against number of trees(x axis).

    1. Pick another parameter you want to experiment on. Plot similar gure 1 and gure 2 as above. What parameters would you pick to achieve the best performance?

    1. Report the feature importances you got from the best random forest re-gression you nd.

    1. Visualize your decision trees. Pick any tree (estimator) in best random forest (with max depth=4) and plot its structure, which is the root node in this decision tree? Is it the most important feature according to the feature importance reported by the regressor?

  1. Now use a neural network regression model (one hidden layer) with all features one-hot encoded. Parameters:

Number of hidden units

Activity Function(relu, logistic, tanh)

Plot Test-RMSE as a function of the number of hidden units for di erent ac-tivity functions. Report the best combination.

    1. Predict the Backup size for each of the work ows separately.

      1. Using linear regression model. Explain if the t is improved?

      1. Try tting a more complex regression function to your data. You can try a polynomial function of your variables. Try increasing the degree of the polynomial to improve your t. Again, use a 10 fold cross validation to evaluate your results. Plot the average train and test RMSE of the trained model against the degree of the polynomial you use. Can you nd a threshold on the degree of the tted polynomial beyond which the generalization error of your model gets worse? Can you explain how cross validation helps controlling the complexity of your model?

    1. Use k-nearest neighbor regression and nd the best parameter.

  1. Compare these regression models you have used and write some comments, such as which model is best at handling categorical features, which model is good at handling sparse features or not? which model overall generates the best results?

  1. Useful functions

Linear Regression Model: sklearn.linear_model.LinearRegression.html

OneHotEncoder: preprocessing.OneHotEncoder.html

Random Forest Model: sklearn.ensemble.RandomForestRegressor.html

Neural Network Models: networks_supervised.html


Polynomial Transformation: sklearn.preprocessing.PolynomialFeatures.html

KNN Regressor: neighbors.KNeighborsRegressor.html

Standardization: preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

Submission: Please submit a zip le containing your report, and your codes with a readme le on how to run your code to CCLE. The zip le should be named as “Project4 UID1 UID2 …” where UIDx’s are student ID numbers of the team members. One submission per team is required. If you have any questions you can contact the TAs or post on Piazza