Assignment #2 Solution

$30.00

Description

List of assignment due dates.

The assignment should be submitted via Blackboard. Submit a file called assignment2.zip, containing the following two files:

  • answers.pdf, for your answers to the written tasks, and for the output that the programming task asks you to include. Only PDF files will be accepted. All text should be typed, and if any figures are present they should be computer-generated. Scans of handwriten answers will NOT be accepted.
  • logistic_regression.m, containing your Matlab code for the programming part. In addition, you must include in your zip file any other matlab files (with auxiliary code) that are needed to run your solution.

These naming conventions are mandatory, non-adherence to these specifications can incur a penalty of up to 20 points.

Your name and UTA ID number should appear on the top line of both documents.


Task 1 (70 points, programming)

In this task you will implement logistic regression using Iterative Reweighted Least Squares, as described in Section 4.3.3 of the PRLM textbook.

Arguments

You must implement a Matlab function called logistic_regression that uses logistic regression to train a linear classifier. Your function should be invoked as follows:

logistic_regression(<training_file>, <degree>, <test_file>)

The arguments provide to the function the following information:

  • The first argument, <training_file>, is the path name of the training file, where the training data is stored. The path name can specify any file stored on the local computer.
  • The second argument, <degree> is a number equal to either 1 or 2. We will not test your code with any other values. The degree specifies what function φ you should use. Suppose that you have an input vector x = (x1, x2, …, xD)T,.
    • If the degree is 1, then φ(x) = (1, x1, x2, …, xD)T.
    • If the number is 2, then φ(x) = (1, x1, (x1)2, x2, (x2)2…, xD, (xD)2)T.
  • The third argument, <test_file>, is the path name of the test file, where the test data is stored. The path name can specify any file stored on the local computer.

The training and test files will follow the same format as the text files in the UCI datasets directory. A description of the datasets and the file format can be found on this link. For each dataset, a training file and a test file are provided. The name of each file indicates what dataset the file belongs to, and whether the file contains training or test data. Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.

As the description states, do NOT use data from the last column (i.e., the class labels) as features. In these files, all columns except for the last one contain example inputs. The last column contains the class label.

Converting to Binary Classification Problem

We have only covered logistic regression for binary classification problems. In this assignment, you should convert the class labels found in the files as follows:

  • If the class label is equal to 1, it stays equal to 1.
  • If the class label is not equal to 1, you must set it equal to 0.

This way, your code will only see class labels that are 1 or 0.

Weight Initialization

All weights must be initialized to 0.

Stopping Criteria

For logistic regression, the training goes through iterations. At each iteration, you should decide as follows if you should stop the training:

  • Compare the new weight values, computed at this iteration, with the previous weight values. If the sum of absolute values of differences of individual weights is less than 0.001, then you should stop the training.
  • Compute the cross-entropy error, using the new weights computed at this iteration. Compare it with the cross-entropy error computed using the previous value of weights. If the change in the error is less than 0.001, then you should stop the training.

If at least one of the two criteria listed above is satisfied, you should stop the training.

Numerical Issues for Yeast Dataset

Your code may not work on the yeast dataset for degree=2. Don’t worry about that, we will not test for that case.

Output of Training Stage

At the end of the training stage, your program should print out the values of the weights that you have estimated. Use:

fprintf('w0=%.4f\n', w0);
fprintf('w1=%.4f\n', w1);
fprintf('w2=%.4f\n', w2);
...

Output of Test Stage

After the training stage, you should apply the classifier that you have learned on the test data. For each test object (following the order in which each test object appears in the test file), you should print a line containing the following info:

  • object ID. This is the line number where that object occurs in the test file. Start with 1 in numbering the objects, not with 0.
  • predicted class (the result of the classification). If your classification result is a tie, choose one of them randomly.
  • probability of the predicted class given the data. This probability is the output of the classifier if the predicted class is 1. If the predicted class is 0, then the probability is 1 minus the output of the classifier.
  • true class (should be binary, 0 or 1).
  • accuracy. This is defined as follows:
    • If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
    • If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
    • If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
    • If there were ties in your classification result, since we only have two classes, the accuracy is 0.5.

To produce this output in a uniform manner, use these printing statements:

fprintf('ID=%5d, predicted=%3d, probability = %.4f, true=%3d, accuracy=%4.2f\n',
                  object_id, predicted_class, probability, true_class, accuracy);

After you have printed the results for all test objects, you should print the overall classification accuracy, which is defined as the average of the classification accuracies you printed out for each test object. To print the classification accuracy in a uniform manner, use this printing statement:

fprintf('classification accuracy=%6.4f\n', classification_accuracy);

In your answers.pdf document, provide the full output of the training stage, and ONLY THE LAST LINE (the line printing the classification accuracy) of the output by the test stage of your program, when given pendigits_training.txt as the training file, and pendigits_test.txt as the test file. Provide this output for both degree=1 and degree=2 (where degree is the second argument).


Task 2 (10 points, written)

Consider function f(a,b) = a2 + 2b2 - 5a - 2ab.

  • Does f(a,b) have a global maximum that is not equal to positive infinity? If so, what is the maximum, and what values of a and b provide that maximum?
  • Does f(a,b) have a global minimum that is not equal to negative infinity? If so, what is the minimum, and what values of a and b provide that minimum?

For both questions, you need to justify your answers. Correct answers with insufficient justification will not receive credit.


Task 3 (10 points, written)

We are given these training examples for a linear regression problem:

x1 = 5.3

,

t1 = 9.6

x2 = 7.1

,

t2 = 4.2

x3 = 6.4

,

t3 = 2.2

We just want to fit a line to this data, and we to find the 2-dimensional vector w that minimizes D(w) as defined in slide 56 of the linear regression slides. What is the value of w in the limit where λ approaches positive infinity? Justify your answer. Correct answers with insufficient justification will not receive credit.


Task 4 (10 points, written)

We are given these training examples for a linear regression problem:

x1 = 5.3

,

t1 = 9.6

x2 = 7.1

,

t2 = 4.2

x3 = 6.4

,

t3 = 2.2

We are also given these two lines as possible solutions:

  • f(x) = 3.1x + 4.2
  • f(x) = 2.4x - 1.5

Which of these lines is a better solution according to the sum-of-squares criterion? This criterion is defined as function ED(w) in slide 25 of the linear regression slides. Justify your answer. Correct answers with insufficient justification will not receive credit.