$35.00
Description
Introduction
Regression analysis is a statistical procedure for estimating the relationship between a target variable and a set of potentially relevant variables. In this project, we explore basic regression models on a given dataset, along with basic techniques to handle overtting; namely crossvalidation, and regularization. With crossvalidation, we test for over tting, while with regularization we penalize overly complex models.
Dataset
We use a Network backup Dataset, which is comprised of simulated tra c data on a backup system over a network. The system monitors the les residing in a destination machine and copies their changes in four hour cycles. At the end of each backup process, the size of the data moved to the destination as well as the duration it took are logged, to be used for developing prediction models. We de ne a work ow as a task that backs up data from a group of les, which have similar patterns of change in terms of size over time. The dataset has around 18000 data points with the following columns/variables:
Week index
Day of the week at which the le back up has started Backup start time: Hour of the day
Work ow ID File name
Backup size: the size of the le that is backed up in that cycle in GB
Backup time: the duration of the backup procedure in hour
Problem Statement

Load the dataset. You can download the dataset from this link. To get an idea on the type of relationships in your dataset:
1

For a twentyday period (Xaxis unit is day number) plot the backup sizes for all work ows (color coded on the Yaxis),

Do the same plot for the rst 105day period.
Can you identify any repeating patterns?

Predict the backup size of a le given the other attributes. We use all attributes, except Backup time, as candidate features for the prediction of backup size.
We will try di erent feature sets, as well as di erent encoding schemes for each feature and in combination. For example, each of the ve features is a categorical variable: Day of the week, hour of the day, work ow number, letype, and week number.
For each categorical variable, we could convert it into a one dimensional numer
ical value. For example, Day of the Week variable could take on values 1; ; 7 corresponding to Monday through Friday. Similarly, the Hour of the Day could be encoded as 1 24. We will refer to this as a scalar encoding.
For each categorical variable that takes one of M values we can also encode it as an M dimensional vector, where only one entry is 1 and the rest are 0’s. Thus for the Day of the Week, Monday could be encoded as [1; 0; 0; 0; 0; 0; 0] and Friday as [0; 0; 0; 0; 0; 0; 1]. We will refer to this encoding as OneHotEncoding.
Now for the ve variables, when looked at as a set, we have 32 (= 2^{5}) possible combinations, where in each combination only a subset of the features are encoded using OneHotEncoding and the rest of the features are encoded using a scalar encoding.
For part ae, for each model you need to report training and test RMSE from 10fold cross validation as basic evaluation of the performance. That is, for each fold you get two numbers: Training RMSE and Test RMSE. In addition, you need to: (i) Plot tted values against true values scattered over the number of data points and (ii) Plot residuals versus tted values scattered over the number of data points using the whole dataset for each model with the best parameters you have found. It visualizes how well your model ts the data.

Fit a linear regression model. We use ordinary least square as the penalty function.
min kY X k^{2}

where the minimization is on the coe cient vector .


First convert each categorical feature into one dimensional numerical values using scalar encoding (e.g. Monday to Sunday can be mapped to 17), and then directly use them to t a basic linear regression model.



Data Preprocessing: Standardize (see the Useful Functions Section) all these numerical features, then t and test the model. How does the tting result change as shown in the plots?



Feature Selection: Use f regression and mutual information regression measure to select three most important variables respectively. Report the

2
three most important variables you nd. Use those three most important variables to train a new linear model, does the performance improve?

Feature Encoding: As explained in the preceding discussions, there are 32 possible combinations of encoding the ve categorical variables. Plot the average training RMSE and test RMSE for each combination (in range 1 to 32). Which combinations achieve best performance? Can you provide an intuitive explanation?

Controlling illconditioning and over ting: You should have found obvious increases in test RMSE compared to training RMSE in some combinations, can you explain why this happens? Observe those tted coefcients. To solve this problem, you can try the following regularizations with suitable parameters.
1. Ridge Regularizer: min kY X k^{2} + k k^{2}_{2}
2. Lasso Regularizer: min kY X k^{2} + k k_{1}
3. Elastic Net Regularizer: min kY X k^{2} + _{1} k k_{1} + _{2} k k^{2}_{2} (optional)
For any choice of the hyperparameter(s) (i.e., ; _{1}; _{2}) you will nd one of the 32 models with the lowest TestRMSE. Optimize over choices of ; _{1}; lambda_{2} to pick one good model. Compare the values of the estimated coe cients for these regularized good models, with the unregularized best model.

Use a random forest regression model for this same task.
Feature importance in random forest algorithm: During the training process, for each node, a branching decision is made based on only one feature that minimized a chosen measure of impurity. For classi cation, it is typically Gini impurity or information gain(entropy) and for regression task, it is variance (see lecture notes). The importance of each feature will be the averaged decreased variance for each node split with this feature in the forest and weighted by the number of samples it splits.
Out of bag error: In the random forest regression, since we use bootstrapping, it’s easier and faster to evaluate the generalization ability. For each tree, only a subset of the data set is used to build it (because of sampling) so the data points that are left out can be used as the test set. One can then de ne prediction
RMSE for each tree and then average over all trees. In sklearn random forest regression, oob score will return out of bag R^{2} score, so you can calcalute 1 oob score as Out Of Bag error.
Set the parameters of your model with the following initial values.
Number of trees: 20
Depth of each tree: 4
Bootstrap: True
Maximum number of features: 5
Recall that a Random Forest model can handle categorical variables without having to use onehot or scalar encodings.

Report Training and average Test RMSE from 10 fold cross validation (sum up each fold’s square error, divide by total number of data then take square root) and Out Of Bag error you get from this initial model.
3


Sweep over number of trees from 1 to 200 and maximum number of features from 1 to 5, plot gure 1 for out of bag error(y axis) against number of trees(x axis), gure 2 for average TestRMSE(y axis) against number of trees(x axis).



Pick another parameter you want to experiment on. Plot similar gure 1 and gure 2 as above. What parameters would you pick to achieve the best performance?



Report the feature importances you got from the best random forest regression you nd.



Visualize your decision trees. Pick any tree (estimator) in best random forest (with max depth=4) and plot its structure, which is the root node in this decision tree? Is it the most important feature according to the feature importance reported by the regressor?


Now use a neural network regression model (one hidden layer) with all features onehot encoded. Parameters:
Number of hidden units
Activity Function(relu, logistic, tanh)
Plot TestRMSE as a function of the number of hidden units for di erent activity functions. Report the best combination.


Predict the Backup size for each of the work ows separately.




Using linear regression model. Explain if the t is improved?





Try tting a more complex regression function to your data. You can try a polynomial function of your variables. Try increasing the degree of the polynomial to improve your t. Again, use a 10 fold cross validation to evaluate your results. Plot the average train and test RMSE of the trained model against the degree of the polynomial you use. Can you nd a threshold on the degree of the tted polynomial beyond which the generalization error of your model gets worse? Can you explain how cross validation helps controlling the complexity of your model?




Use knearest neighbor regression and nd the best parameter.


Compare these regression models you have used and write some comments, such as which model is best at handling categorical features, which model is good at handling sparse features or not? which model overall generates the best results?

Useful functions
Linear Regression Model: http://scikitlearn.org/stable/modules/generated/ sklearn.linear_model.LinearRegression.html
OneHotEncoder: http://scikitlearn.org/stable/modules/generated/sklearn. preprocessing.OneHotEncoder.html
Random Forest Model: http://scikitlearn.org/stable/modules/generated/ sklearn.ensemble.RandomForestRegressor.html
Neural Network Models: http://scikitlearn.org/stable/modules/neural_ networks_supervised.html
4
Polynomial Transformation: http://scikitlearn.org/stable/modules/generated/ sklearn.preprocessing.PolynomialFeatures.html
KNN Regressor: http://scikitlearn.org/stable/modules/generated/sklearn. neighbors.KNeighborsRegressor.html
Standardization: http://scikitlearn.org/stable/modules/generated/sklearn. preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
Submission: Please submit a zip le containing your report, and your codes with a readme le on how to run your code to CCLE. The zip le should be named as “Project4 UID1 UID2 … UIDn.zip” where UIDx’s are student ID numbers of the team members. One submission per team is required. If you have any questions you can contact the TAs or post on Piazza
5