List of assignment due dates.
The assignment should be submitted via Blackboard. Submit a file called assignment5.zip, containing the following two files:
- answers.pdf, for your answers to the written tasks, and for the output that the programming task asks you to include. Only PDF files will be accepted. All text should be typed, and if any figures are present they should be computer-generated. Scans of handwriten answers will NOT be accepted.
- k_means_cluster.m, containing your Matlab code for the programming part. In addition, you must include in your zip file any other source files (with auxiliary code) that are needed to run your solution.
These naming conventions are mandatory, non-adherence to these specifications can incur a penalty of up to 20 points.
Your name and UTA ID number should appear on the top line of both documents.
Task 1 (70 points, programming)
In this task you will implement k-means clustering.
Your function will be invoked as follows:
k_means_cluster(<data_file>, <k>, <iterations>)
The arguments provide the following information:
- The first argument, <data_file>, is the path name of a file where the data is stored. The path name can specify any file stored on the local computer.
- The second argument, <k>, specifies the number of clusters.
- The third argument, <iterations>, specifies the number of iterations of the main loop. The initialization stage (giving a random assignment of objects to clusters, and computing the means of those random assignments) does not count as an iteration.
The data file will follow the same format as the training and test files in the UCI datasets directory. A description of the datasets and the file format can be found on this link. Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.
As the description states, do NOT use data from the last column (i.e., the class labels) as features. In these files, all columns except for the last one contain example inputs. The last column contains the class label.
- Use the L2 distance (the Euclidean distance) for computing the distance between any two objects in the dataset.
Output on the Screen
After the initialization stage, and after each iteration, you should print the value E(S1,S2,...,SK), as defined in page 27 of the clustering slides.
The output should follow this format:
After initialization: error = %.4f After iteration 1: error = %.4f After iteration 2: error = %.4f ...
Output for answers.pdf
In your answers.pdf document, you need to provide the complete output for the following invocations of your program:
k_means_cluster('yeast_test.txt', 2, 5) k_means_cluster('yeast_test.txt', 3, 5)
- 55 points: Correct implementation of k-means clustering.
- 15 points: Following the specifications in producing the required output.
Task 2 (10 points)
Consider the set of points above. Each dot corresponds to a point. Consider a clustering consisting of two clusters, where the first cluster is the set of all the blue dots, and the second cluster has the red dot as its only element. Can this clustering be the final result of the k-means algorithm? Justify your answer.
Your answer should be based on your visual estimations of Euclidean distances between dots. For this particular question, no greater precision is needed.
Task 3 (10 points)
For both parts of this task, assume that the algorithm in question never needs to break a tie (i.e., it never needs to choose between two distances that are equal).
Part a: Will the k-means algorithm always give the same results when applied to the same dataset with the same k? To phrase the question in an alternative way, is there any dataset where the algorithm can produce different results if run multiple times, with the value of k kept the same? If your answer is that k-means will always give the same results, justify why. If your answer is that k-means can produce different results if run multiple times, provide an example.
Part b: Same question as in part a, but for agglomerative clustering with the dmin distance. Here, as “result” we consider all intermediate clusterings, between the first step (with each object being its own cluster) and the last step (where all objects belong to a single cluster). Will this agglomerative algorithm always give the same results when applied to the same dataset? If your answer is “yes”, justify why. If your answer is “no”, provide an example.
Task 4 (10 points)
Consider a dataset consisting of these eight points: 2, 4, 7, 11, 16, 22, 29, 37.
Part a: Show the results (all intermediate clusterings) obtained by applying agglomerative clustering to this dataset, using the dmin distance.
Part b: Show the results (all intermediate clusterings) obtained by applying agglomerative clustering to this dataset, using the dmax distance.
Optional Tasks, No Credit
Here some tasks that are optional, and for which you will receive NO credit. These tasks will give you some additional hands-on experience with clustering.
- Optional Task 1, 0 points: Implement a way to identify when the k-means algorithm has converged, so as to stop the iterations of the main loop.
- Optional Task 2, 0 points: Implement EM clustering.
- Optional Task 3, 0 points: Implement k-medoid clustering, and apply it on the time series dataset of the optional DTW assignment, with DTW as the underlying distance measure.
- Optional Task 4, 0 points: Implement agglomerative clustering, with dmean as the distance measure between clusters. Apply your implementation on the time series dataset of the optional DTW assignment, with DTW as the underlying distance measure.