$30.00
Description
Instruction: This assignment should be completed individually. Please make sure your answer is legible (and preferably formatted using MS Words or LAT_{E}X/L_{Y}X). If a question requires you to follow an algorithm, show a clear trace of the algorithm. If the algorithm is iterative, show the details in the first two iterations. For each of the remaining iterations, show the status of the algorithm at the end of the iteration. Please also submit to the BlackBoard a single your_name_hwk03.zip file that contains a PDF or a Word version of your solutions (no scanned image please), and the source code, the input data, and the output of your program.
The following questions are based on this dataset (table) from an employee database.

department
status
age
salary
count
sales
senior
31..35
46k..50k
30
sales
junior
26..30
26k..30k
40
sales
junior
31..35
31k..35k
40
systems
junior
21..25
46k..50k
20
systems
senior
31..35
66k..70k
5
systems
junior
26..30
46k..50k
3
systems
senior
41..45
66k..70k
3
marketing
senior
36..40
46k..50k
10
marketing
junior
31..35
41k..45k
4
secretary
senior
46..50
36k..40k
4
secretary
junior
26..30
26k..30k
6
The data is a summary of the original data table. For example, the first row indicates that 30 employees in the sales department has an age between 31 and 35 inclusive and a salary between 46K and 50K inclusive. The attribute status is the class label.

[30] Explain how the basic decision tree algorithm can be extended to incorporate the ranges (for Age and Salary) and the counts (in the count column) into the calculation of the impurity measures. Then, use the extended algorithm by hand to find the best split of the given data using the following impurity measures. You only need to show how to find the best split at the root node of the decision tree. You need to show the details of the calculations for at least one attribute, and show the results for the rest of attributes. You may want to write a program to perform the calculations. If you do so, also hand in your program source code.


information gain



gain ratio

Computer Science 4373 Assignment 3 February 4, 2018


gini index


[30] Extend the Naive Bayes classifier algorithm so that it can also incorporate the ranges and counts in calculation of the probabilities.


Show how the extended algorithm would calculate the prior probabilities and the conditional probabilities P (A_{k} j C) using the data table as the training data



Show how the extended algorithm would determine the status of the following data tuple

t =< department : systems; status :?; age : 28; salary : 50K >
Again, you need to show the details of the calculation for some of the probabilities, and for tuple t.

[20] Use this dataset to create a suitable new data file, either hwk03.arff or hwk03.csv, by replicating each row with the number of copies as indicated in the count column. For example, you should make the first row in the given table appear 30 times in the new table. Then, remove the count column.
Write a program that trains a decision tree using the new data file as the training data and use the decision tree to predict the status of a user provided unseen data, for example,
t =< department : systems; status :?; age : 28; salary : 50K >
Specifically, you either write a Java program that uses Weka’s J48 or a Python Jupyter notebook that uses SciKitLearn’s DecisionTreeClassifier to learn the decision tree. Notice that SciKitLearn requires to encode categorical attributes as integer attributes.
You may have to convert the actual age and salary into the corresponding ranges for the decision tree to work on the unseen data.

[20] Make another new dataset (named hwk0302.arff or hwk0302.csv) from the data file obtained in the previous exercise by converting the values in the age and salary columns to random values drawn from the specific range for each row. For example, suppose the age of a row is “31..35”, replace it by a random integer between 31 and 35 inclusively.
Write a program that uses either Weka or SciKitLearn to learn a Naive Bayes classifier and use it to find the status of a user provided unseen data, for example,
t =< department : systems; status :?; age : 28; salary : 50K >
2