Name: Solved-Assignment 4 Classification -Solution
SKU: 21874
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Introduction

In this assignment you will implement a Naive Bayes classi er for the purpose of binary classi cation.

You may not use any functions from an ML library in your code. And as always your code should work on any dataset that has the same general form as the provided one.

Grading

Although all assignments will be weighed equally in computing your homework grade, below is the grading rubric we will use for this assignment:

Part 1	(Theory)	55pts
Part 2	(Naive Bayes)	35pts
Report		10pts
Extra Credit		10pts

TOTAL		110 (of 100) pts

Datasets

Spambase Dataset (spambase.data) This dataset consists of 4601 instances of data, each with 57 features and a class label designating if the sample is spam or not. The features are real valued and are described in much detail here:

https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names

Data obtained from: https://archive.ics.uci.edu/ml/datasets/Spambase

Theory

1. Consider the following set of training examples for an unknown target function: (x₁; x₂) ! y:

Y	^x1	^x2	Count
+	T	T	3
+	T	F	4
+	F	T	4
+	F	F	1
–	T	T	0
–	T	F	1
–	F	T	3
–	F	F	5

1. What is the sample entropy, H(Y ) from this training data (using log base 2) (5pts)?

1. What are the information gains for branching on variables x₁ and x₂ (5pts)?

1. Draw the decision tree that would be learned by the ID3 algorithm without pruning from this training data. All leaf nodes should have a single class choice at them. If necessary use the mean class or, in the case of a tie, choose one at random.(10pts)?

We decided that maybe we can use the number of characters and the average word length an essay to determine if the student should get an A in a class or not. Below are ve samples of this data:

# of Chars	Average Word Length	Give an A

216	5.68	Yes
69	4.78	Yes
302	2.31	No
60	3.16	Yes
393	4.2	No

1. What are the class priors, P (A = Y es); P (A = N o)? (5pts)

1. Find the parameters of the Gaussians necessary to do Gaussian Naive Bayes classi cation on this decision to give an A or not. Standardize the features rst over all the data together so that there is no unfair bias towards the features of di erent scales (5pts).

1. Using your response from the prior question, determine if an essay with 242 characters and an average word length of 4.56 should get an A or not. Show the math to support your decision (10pts).

Another common activation function for use in logistic regression or arti cial neural networks is the hyperbolic tangent function, tanh, which is de ned as:

tanh(z) =	e^z		e ^z	(1)
		e^z	+ e ^z

(a) Since the hyperbolic tangent function outputs values in the range of 1 <= tanh(z) <= 1 we will have to augment our log likelihood objective function to deal with this range. If

we opt to use this function for logistic regression (as opposed to	1		), what will this
	1+e	z

object function be? Show your work. (5pts)

In order to compute the gradient of your previous answer with respect to _j, we’ll need to compute the gradient of the hyperbolic tangent function itself. Use the exponential de nition of the hyperbolic tangent function provided at the top of this problem to show that

_@^@_j (tanh(x )) = x_j(1 tanh(x )²). (5pts)

Using the fact that _@^@_j (tanh(x )) = x_j(1 tanh(x )²), what is the gradient of your log likelihood function in part (a) with respect to _j? Show your work. (5pts)

Naive Bayes Classi er

For your rst programming task, you’ll train and test a Naive Bayes Classi er.

Download the dataset spambase.data from Blackboard. As mentioned in the Datasets area, this dataset contains 4601 rows of data, each with 57 continuous valued features followed by a binary class label (0=not-spam, 1=spam). There is no header information in this le and the data is comma separated. As always, your code should work on any dataset that lacks header information and has several comma-separated continuous-valued features followed by a class id 2 0; 1.

Write a script that:

Reads in the data.

Randomizes the data.

Selects the rst 2/3 (round up) of the data for training and the remaining for testing

Standardizes the data (except for the last column of course) using the training data

Divides the training data into two groups: Spam samples, Non-Spam samples.

Creates Normal models for each feature for each class.

Classify each testing sample using these models and choosing the class label based on which class probability is higher.

Computes the following statistics using the testing data results:

1. Precision

1. Recall

1. F-measure

1. Accuracy

Implementation Details

Seed the random number generate with zero prior to randomizing the data

Matlab interprets 0log0 as N aN (not a number). You should identify this situation and consider it to be a value of zero.

In your report you will need:

1. The statistics requested for your Naive Bayes classi er run.

Precision: Around 68%

Recall: Around 95%

F-Measure: Around 79%

Accuracy: Around 81%

Table 1: Evaluation for Naive Bayes classi er

Extra Credit: The Precision-Recall Tradeo

For 10 extra credit points, nd a data set of your choosing on which you can perform binary classi-cation. Now apply your Naive Bayes code to this dataset and vary the threshold required to make an observation be considered your positive class.

Write a script that:

Reads in the data.

Randomizes the data.

Selects the rst 2/3 (round up) of the data for training and the remaining for testing

Standardizes the data using the training data

Divides the training data into two groups: Positive samples, Negative samples.

Creates Normal models for each feature and each class.

Computes the P (P ositivejdata) and P (N egativejdata) for each testing sample using Naive Bayes, normalizing them such that P (P ositivejdata) + P (N egativejdata) = 1.
Vary the threshold from 0.0 to 1.0 in increments of 0.05, each time:

1. Using the current threshold, label each testing sample as True Positive, True, Negative, False Positive, or False Negative

1. Compute the Precision and Recall for this threshold level.

Plot Precision vs Recall.

Implementation Details

In computing Precision, Recall, F-Measure and Accuracy make sure the denominators don’t become zero. If they do, check the numerator. If that’s also zero then set the value to one.

Submission

For your submission, upload to Blackboard a single zip le (again no spaces or non-underscore special characters in le or directory names) containing:

PDF Writeup

Source Code

If you did the extra credit, the dataset used.

readme.txt le

The readme.txt le should contain information on how to run your code to reproduce results for each part of the assignment.

The PDF document should contain the following:

Part 1:

1. Answers to theory questions

Part 2:

1. Requested Classi cation Statistics

Extra Credit:

1. Citation and link to dataset.

1. Plot of Precision-vs-Accuracy

Solved-Assignment 4 Classification -Solution

Description

Related products

Project One: Top of Pile Solution

Homework 5: Minion Agents Solution

Homework 4: A* Pathfinding Solution

Homework 3: Navigation Mesh Generation Solution

Homework 2: Path Network Navigation SOlution