$30.00
Description
Submission
Submit your solutions electronically on the course Gradescope site as PDF les.
If you plan to typeset your solutions, please use the LaTeX solution template. If you must submit scanned handwritten solutions, please use a black pen on blank white paper and a highquality scanner app.
We de ne a set of concepts
H = fsgn(ax^{2} + bx + c); a; b; c; 2 Rg;
where sgn( ) is 1 when the argument is positive, and 0 otherwise. What is the VC dimension of H? Prove your claim.

Kernels [15 pts]
Given vectors x and z in R^{2}, de ne the kernel K (x; z) = (1 + x z)^{3} for any value > 0.
Find the corresponding feature map ( )^{1}. What are the similarities/di erences from the kernel
K(x; z) = (1 + x z)^{3}, and what role does the parameter play?

SVM [20 pts]
Suppose we are looking for a maximummargin linear classi er through the origin, i.e. b = 0 (also hard margin, i.e., no slack variables). In other words, we minimize ^{1}_{2} jjwjj^{2} subject to y_{n}w^{T} x_{n} 1; n = 1; : : : ; N.

Suppose we have two training examples, x_{1} = (1; 1)^{T} and x_{2} = (1; 0)^{T} with labels y_{1} = 1 and y_{2} = 1. What is w in this case?

Suppose we now allow the o set parameter b to be nonzero. How would the classi er and the margin change in the previous question? What are (w ; b )? Compare your solutions with and without o set.

Twitter analysis using SVMs [50 pts]
In this project, you will be working with Twitter data. Speci cally, we have supplied you with a number of tweets that are reviews/reactions to movies^{2},
e.g., \@nickjfrost just saw The Boat That Rocked/Pirate Radio and I thought it was brilliant! You and the rest of the cast were fantastic! < 3″.
You will learn to automatically classify such tweets as either positive or negative reviews. To do this, you will employ Support Vector Machines (SVMs), a popular choice for a large number of classi cation problems.
Download the code and data sets from the course website. It contains the following data les:
tweets.txt contains 630 tweets about movies. Each line in the le contains exactly one tweet, so there are 630 lines in total.
labels.txt contains the corresponding labels. If a tweet praises or recommends a movie, it is classi ed as a positive review and labeled +1; otherwise it is classi ed as a negative review and labeled 1. These labels are ordered, i.e. the label for the i^{th} tweet in tweets.txt corresponds to the i^{th} number in labels.txt.
Skim through the tweets to get a sense of the data.
The python le twitter.py contains skeleton code for the project. Skim through the code to understand its structure.
4.1 Feature Extraction [10 pts]
We will use a bagofwords model to convert each tweet into a feature vector. A bagofwords model treats a text le as a collection of words, disregarding word order. The rst step in building a bagofwords model involves building a \dictionary”. A dictionary contains all of the unique words in the text le. For this project, we will be including punctuations in the dictionary too. For example, a text le containing \John likes movies. Mary likes movies2!!” will have a dictionary { John :0, Mary :1, likes :2, movies :3, movies2 :4, . :5, ! :6}. Note that the (key,value) pairs are (word, index), where the index keeps track of the number of unique words (size of the dictionary).
Given a dictionary containing d unique words, we can transform the n variablelength tweets into n feature vectors of length d by setting the i^{th} element of the j^{th} feature vector to 1 if the i^{th} dictionary word is in the j^{th} tweet, and 0 otherwise.

We have implemented extract_words(…) that processes an input string to return a list of unique words. This method takes a simplistic approach to the problem, treating any string of characters (that does not include a space) as a \word” and also extracting and including all unique punctuations.
Implement extract_dictionary(…) that uses extract_words(…) to read all unique words contained in a le into a dictionary (as in the example above). Process the tweets in the order they appear in the le to create this dictionary of d unique words/punctuations.

Next, implement extract_feature_vectors(…) that produces the bagofwords represen
tation of a le based on the extracted dictionary. That is, for each tweet i, construct a feature vector of length d, where the j^{th} entry in the feature vector is 1 if the j^{th} word in the dictionary is present in tweet i, or 0 otherwise. For n tweets, save the feature vectors in a feature matrix, where the rows correspond to tweets (examples) and the columns correspond to words (features). Maintain the order of the tweets as they appear in the le.

In main(…), we have provided code to read the tweets and labels into a (630; d) feature matrix and (630; 1) label array. Split the feature matrix and corresponding labels into your training and test sets. The rst 560 tweets will be used for training and the last 70 tweets will be used for testing. **All subsequent operations will be performed on these data.**

Report the dimensionality of the feature matrix in problem 4.1.(c) in your writeup.
4.2 Hyperparameter Selection for a LinearKernel SVM [30 pts]
Next, we will learn a classi er to separate the training data into positive and negative tweets. For the classi er, we will use SVMs with the linear kernel. We will use the sklearn.svm.SVC class^{3} and explicitly set only three of the initialization parameters: kernel, and C. As usual, we will use SVC.fit(X,y) to train our SVM, but in lieu of using SVC.predict(X) to make predictions, we will use SVC.decision_function(X), which returns the (signed) distance of the samples to the separating hyperplane.
SVMs have hyperparameters that must be set by the user. For both linear kernel SVMs, we will select the hyperparameters using 5fold crossvalidation (CV). Using 5fold CV, we will select the hyperparameters that lead to the `best’ mean performance across all 5 folds.

The result of a hyperparameter selection often depends upon the choice of performance mea
sure. Here, we will consider the following performance measures: accuracy, F1Score, and AUROC^{4}.
Implement performance(…). All measures are implemented in sklearn.metrics library.

Next, implement cv_performance(…) to return the mean kfold CV performance for the performance metric passed into the function. Here, you will make use of SVC.fit(X,y) and SVC.decision_function(X), as well as your performance(…) function.
You may have noticed that the proportion of the two classes (positive and negative) are not equal in the training data. When dividing the data into folds for CV, you should try to keep the class proportions roughly the same across folds. In your writeup, brie y describe why it might be bene cial to maintain class proportions across folds. Then, in main(…), use sklearn.cross_validation.StratifiedKFold(…) to split the data for 5fold CV, making sure to stratify using only the training labels.

Now, implement select_param_linear(…) to choose a setting for C for a linear SVM based on the training data and the speci ed metric. Your function should call cv_performance(…),
passing in instances of SVC(kernel= linear , C=c) with di erent values for C, e.g., C = 10 ^{3}; 10 ^{2}; 10 ^{1}; 1; 10; 10^{2}.

Finally, using the training data from Section 4.1 and the functions implemented here, nd the best setting for C for each performance measure mentioned above. Report your ndings in tabular format (up to the fourth decimal place):
10 ^{3}
10 ^{2}
10 ^{1}
10^{0}
10^{1}
10^{2}
best C
Your select_param_linear(…) function returns the `best’ C given a range of values. How does the 5fold CV performance vary with C and the performance metric?
4.3 Test Set Performance [10 pts]
In this section, you will apply the classi er learned in the previous sections to the test data from Section 4.1. Once you have predicted labels for the test data, you will measure performance.

Based on the results you obtained in Section 4.2, choose a hyperparameter setting for the linearkernel SVM. Then, in main(…), using the training data extracted in Section 4.1 and SVC.fit(…), train a linearkernel SVM with your chosen settings.

Implement performance_test(…) which returns the value of a performance measure, given the test data and a trained classi er.

For each performance metric, use performance_test(…) and the trained linearkernel SVM classi er to measure performance on the test data. Report the results. Be sure to include the name of the performance metric employed, and the performance on the test data.