Unsupervised Clustering Solution



a The assignment should be submitted in the PDF format through Collob. If you prefer hand-writing QA parts of answers, please convert them (e.g., by scanning or using PhoneApps like o ceLens) into PDF form.

  • For questions and clari cations, please post on piazza.

  • Policy on collaboration:

Homework should be done individually: each student must hand in their own answers. It is acceptable, however, for students to collaborate in guring out answers and helping each other solve the problems. We will be assuming that, with the honor code, you will be taking the responsibility to make sure you personally understand the solution to any work arising from such collaboration.

d Policy on late homework: Homework is worth full credit at the midnight on the due date. Each student has three extension days to be used at his or her own discretion throughout the entire course. Your grades would be discounted by 15% per day when you use these 3 late days. You could use the 3 days in whatever combination you like. For example, all 3 days on 1 assignment (for a maximum grade of 55%) or 1 each day over 3 assignments (for a maximum grade of 85% on each). After you’ve used all 3 days, you cannot get credit for anything turned in late.

  • Unsupervised Learning with Clustering

In this programming assignment, you are required to implement clustering algorithm: K-means Clustering. A ZIP le has been provided (\data sets clustering.zip” ) that includes two di erent datasets. Please follow all instructions for submitting source code.

You are required to submit a source-code le \clustering.py” containing the necessary functions for training and evaluations. The maximum number of iterations to be performed for both algorithms is 1000.

DO NOT use scikit-learn package in this problem and please implement from scratch.

1.1 Data description

We have provided two di erent datasets for clustering tasks.

Dataset 1 : The rst dataset consists of height and weight data for average people and baseball players. First column contains human height (inches) and second column has human weight (lbs), while third column has true labels of samples that will be used only for evaluations.

Dataset 2 : The second dataset is for a speech versus music classi cation task. This dataset has been preprocessed and rst 13 columns contain 13 features extracted from audio les. Last column has true labels of samples that will be used only for evaluations.

1.2 load data

(Q1) You are required to code the following function for loading datasets: X = loadData( leDj)


1.3 K-means Clustering

(Q2) Next, code the following function to implement k-means clustering: labels = kmeans(X, k, maxIter)

Here X is the input data matrix, k is the number of clusters and maxIter is the maximum number of the iterations selected by you (max value =1000).

(Q3) Implement k-means clustering for Dataset 1(use rst two columns in the le as input) and use scatter() function in the matplotlib package to visualize the result. The two clusters must be in di erent colors.

(Q4) Implement k knee- nding method for Dataset 1 and k = f1,2,…,6g to select value of k (number of clusters) and plot graph for k versus objective function value (e.g. Slide 99, Lecture 20).

(Q5) Now, code the following function to calculate the purity metric for the evaluation of results: purityMetric = purity(labels, trueLabels)

Use this function to evaluate the results of (Q3)

1.4 How will your code be checked?

We will run the following command: \python clustering.py DatasetDirectoryFullPath” and your code should print the following results:

the scatter plots from (Q3) k knee- nding plot in (Q4)

ALL purityMetric values for results obtained in (Q3)

1.5 Submission

Please submit your source code as “clustering.py” and PDF report containing your written answers via collab. In the report, you should include the following contents:

ALL scatter plots generated in (Q3) k knee- nding plot in (Q4)

ALL purityMetric values for results obtained in (Q3)