Assignment 1: Basic classifiers Solution



  • kD Tree

Design and code your own kD tree data structure as a Python object. Your tree can be less general than a standard implementation; you need only account for two data dimensions, and you need only support 1-nearest-neighbor search. Your class should be structured like the following:

class KDTree:

def __init__(self, matrix):

  • matrix should be a numpy array where each row is a data element

  • and each column is a feature. This operation should recursively

  • split the data along the median of each feature vector in turn.

def find_nearest(self, vector):

  • vector is a numpy array with a single row representing a single

  • data element. This method should traverse the tree to a leaf,

  • then search enough of the tree to guarantee finding the nearest

  • neighbor. This neighbor vector should be returned.

This data structure must work correctly, and you may use it in this assignment. In the future (and on the rest of this assignment, if you wish), you may want to use scipy.spatial.cKDTree instead; it is likely more exible and e cient.

  • Constructing a simple data set

Draw 5000 data points from each of two di erent 2D multivariate Gaussian distributions, with means and covariances of your own choosing. The two distributions should be distinct but over-lapping, as illustrated in Figure 1. You may use np.random.randn to generate your x1 and x2 coordinates independently, but this will limit your distributions to those with diagonal covariance matrices. Alternately, you may use np.random.multivariate normal to sample from an arbitrarily-covarying 2D distribution. You should end up with a matrix X, which will be 10000 2 in size, and a column vector y of 10000 elements. The ith element of y should be 0 if the ith row of X came from your rst distribution, and 1 if it came from the second.

Randomly partition X and y into training and test sets. The easiest way to do this is to construct a 10000-element column vector of booleans and use it as a mask.


Figure 1: 10000 data points from two Gaussian distributions

  • Linear classi er

Using your training set (which the equation below calls X), construct the maximum-likelihood linear least squares function to t your data.

= (XT X) 1XT y


Recall that Numpy multiplication is elementwise by default; use the dot() method for matrix multiplication. Inverses can be computed using Numpy’s linalg.inv() method.

You can now use to classify the elements of your test set. is itself a regression, of course, but you can use a threshold to turn it into a classi er: y^ = 0 if xT < 0:5, 1 otherwise.

Calculate and print the classi cation accuracy of your algorithm by comparing the true class y and the estimated class y^ for each element of your test set.

Produce a plot that displays all of the following in distinct colors or shapes:

Training set elements from class 0 Training set elements from class 1

Correctly classi ed test set elements from class 0 Correctly classi ed test set elements from class 1 Incorrectly classi ed test set elements from class 0 Incorrectly classi ed test set elements from class 1


Figure 2: 10000 data points in two classes drawn from ten di erent Gaussians

  • Nearest neighbors classi cation

Using the same training and test sets as before, construct a kD tree from your training set. Classify each element of your test set by looking up its nearest neighbor from the training set and assigning y^ to be whatever value y belongs to the nearest training example.

As before, calculate and show the classi cation accuracy and produce a similar plot.

  • Increasing complexity

Create a new set of training and test data. This time, each classi cation will be produced by multiple distributions, rather than just one. Draw 1000 samples each from ten di erent overlapping Gaussian distributions. Five of them should be labeled as class 0 and the others class 1. An example is shown in Figure 2. Perform the same linear and nearest neighbor classi cation processes, calculate the classi cation accuracy and plot the results.