SVM, and Model Selection Solution



  • 1. Support Vector Machines with Scikit-Learn

(1) Install the latest stable version of scikit-learn following directions available at Also make sure to download “salary.labeled.csv” from collab.

(2) For this assignment, you will create a program using scikit-learn’s C-Support Vector Classi er.1

Given a proper set of attributes, the program will be able to determine whether an individual makes more than 50,000 USD/year. You may use code from HW2 to help you import the data. Bear in mind you will also need to do some preprocessing of the data before applying the SVM.

Two sample les are provided. The unlabeled sample set “salary.2Predict.csv” is a text le in the same format as the labeled dataset \salary.labeled.csv”, except that its last column includes a fake eld for class labels.

2.1 You are required to provide the predicted labels for samples in “salary.2Predict.csv”.

2.2 We will evaluate your output `predictions’ – an array of strings (\>50K” or \<=50K”) corresponding to the true labels of these test samples (ATT: you don’t have these labels !!! ). This simulates a Kaggle-competition in which test labels are always held out and only team-ranking be released after all teams have submitted their predictions. When grading this assignment, we will rank all students’ predictions. So please try to submit the best performing model that you can!

2.3 You need to report the classi cation accuracy results from 3-fold cross validation (CV) on the labeled set using at least three di erent SVM kernels you pick. Please provide details about the kernels you have tried and their performance (e.g. classi cation accuracy ) on train and test folds into the writing. For instance, you can summarize the results into a table with each row containing kernel choice, kernel parameter, CV train accuracy and CV test accuracy.

(Hint: you can choose SVM kernels like, basic linear kernel / polynomial kernel, varying its parameters / RBF kernel, varying its parameters).

Submission Instructions: You are required to submit the following :

(The starting code, ‘income classi’, has been provided in Collab.)

1. A python program that includes the statements:

c l f = S v m I n c o m e C l a s s i f i e r ( )

t r a i n e d m o d e l , c v s c o r e = c l f . t r a i n a n d s e l e c t m o d e l ( ` ` s a l a r y . l a b e l e d . c s v ” )

It should be able to train and select a model using a set of hyperparameters on the training data, these hyperparameters can be hard coded or be input by the user.

Next, we should be able to use a trained model to classify an unlabeled test set using the following function:

p r e d i c t i o n s = c l f . p r e d i c t ( ` ` s a l a r y . 2 P r e d i c t . c s v ” , t r a i n e d m o d e l )

2. A le \predictions.txt” generated by:

c l f . o u t p u t r e s u l t s ( p r e d i c t i o n s )

Please do not archive the le or change the le name for the automated grading.

  1. A table in your PDF submission reporting classi cation accuracy (score) averaged over the test folds, along with details of the kernels, best performing hyperparameter C for each case etc.

Classes: >50K, <=50K.


age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,


education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,

Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan,

Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.