Image Classification using Fully Connected Neural Network Solution



Problem 1. Write a function that evaluates the trained network (5 points), as well as computes all the subgradients of W1 and W2 using backpropagation (5 points)

Evaluation (5 points)

Algorithm 1 : Evaluation

class SigmoidCrossEntropy(object):

def crossEntropy(self, x, y, w1, w2, l2_penalty=0.0):

# cross entropy loss

E = np.sum(y * np.log(x) + (1.0 y) * np.log(1-x)) / y.shape[0]

# regularization

E += 0.5 * l2_penalty * (np.linalg.norm(w1) + np.linalg.norm(w2))

return E

def evaluate(self, x, y, w1, w2, l2_penalty=0.0):

prob = self.sigmoid(x) # P(y=1)

E = self.crossEntropy(prob, y, w1, w2, l2_penalty) # objective loss


y_hat = 1 * (prob >= 0.5) # class prediction

accuracy = 1 (np.sum(y_hat ^ y) / y.shape[0]) # error rate


return performance

For evaluation, cross entropy loss function is used to measure the error of prediction, and error rate is also calculated to check the ratio of correct classification as metric of accuracy.

cross entropy loss (E) =

error rate (Accuracy) =

(y logz2 + (1 y) log(1 z2))

number of correct classification number of examples

Backpropagation (5 points)

Algorithm 2 : Backpropagation

class LinearTransform(object):

def backward(self, grad_output):

return, self.W.T)

class ReLU(object):

def backward(self, grad_output): gradient = (self.x > 0) * 1.0 gradient[np.where(self.x==0)] = 0.5 return gradient*grad_output

class SigmoidCrossEntropy(object):

def backward(self, grad_output):

return (self.prob self.y)

class MLP(object):

def train(self, x_batch, y_batch, learning_rate, momentum, l2_penalty):

# backpropagation

gradient3 = self.SCE.backward(0) gradient2 = self.LT2.backward(gradient3) gradient1 = self.ReLUf.backward(gradient2)

# weight update

delta_w2 =, gradient3) delta_w1 =, gradient1) self.LT2.update(delta_w2, learning_rate, momentum, l2_penalty) self.LT1.update(delta_w1, learning_rate, momentum, l2_penalty)

There are three gradient functions of each linear transform(f ), Relu(g), and sigmoid cross entropy(E) functions. Loss function can be represented with feed forward func- tions as below.

E = (y logσ(f2 ) + (1 y)log(1 σ(f2 )))


f2 = WT g + c

g = max(0, f1)


f1 = WT x + b

The derivative of each functions is implemented based on its own differential formula. The derivative of the combined sigmoid entropy functions, gradient3, is



= z2 y

The derivatives of linear transform function wrt. g, gradient2, is



= W2

The derivatives of Relu functions, gradient1, is

1, f1 > 0


= [0, 1], f1 = 0


0, f1 < 0



To calculate the delta of each weight vectors, we compute E


and E


and update

the weights.





f2 W2

= (z2 y)g

E = E f2 g




g f1 W1


= (z2 y)WT g0x

Problem 2. Write a function that performs stochastic mini-batch gradient descent training (5 points). You may use the deterministic approach of permuting the sequence of the data. Use the momentum approach described in the course slides.

Stochastic mini-batch gradient descent training (5 points)

Algorithm 3 : Stochastic mini-batch gradient descent

if name

== main :

for epoch in xrange(num_epochs): randList = np.arange(num_examples) np.random.shuffle(randList)

batches = randList.reshape((num_batches, int(num_examples/num_batches)))

for b in xrange(num_batches): x_batch = train_x[batches[b],:] y_batch = train_y[batches[b],:]

total_loss = mlp.train(x_batch, y_batch, lr, momentum, l2_penalty)

For stochastic mini-batch gradient descent training, we need to divide whole examples into the subset of mini batches. In my implementation, I first randomly generate list of order, randList (instead of shuffling examples), then divide the list with the defined number of batches. Then, each example of batches is executed according to the ran- domly generated order from the shuffled list.

Momentum (5 points)

Algorithm 4 : Momentum

class LinearTransform(object):

def update(self, delta, learning_rate=1.0, momentum=0.0, l2_penalty=0.0):

regulization = l2_penalty * self.W

delta = delta + regulization

self.velocity = momentum * self.velocity learning_rate * delta self.W += self.velocity

Whenever updating weights for every batches, I apply the momentum factor to control the weight changes along with the learning rate.

Problem 3-6. 3) Train the network on all the training examples, tune your parameters (number of hidden units, learning rate, mini-batch size, momentum) until you reach a good performance on the testing set. What accuracy can you achieve? (20 points based on the report). 4) Training Monitoring: For each epoch in training, your function should evaluate the training objective, testing objective, training misclassification error rate, testing misclassification error rate (5 points). 5) Tuning Parameters: please create three figures with following requirements. Save them into jpg format:

i) test accuracy with different number of batch size: batch-test accuracy.png ii) test accuracy with different learning rate: lr-test accuracy.png

iii) test accuracy with different number of hidden units: hidden units-test accuracy.png

6) Discussion about the performance of your neural network.

I first tuned learning rate, which is the most important to get to the local minimum, and then tuned mini-batch size, hidden units, momentum, and l2 penalty respectively in order to train the model. Each section, I put the range of test parameter in […], and the rest predefined values of other parameters. I used 100 epoches for all experiments.

Tuning learning rate

learning rate = [1e-06, 5e-06, 1e-05, 5e-05, 1e-04]

num batches = 1000 hidden units = 10 momentum = 0.8

l2 penalty = 0.001

Figure 1: Train Loss Figure 2: Train Accuracy

Analysis: Any other learning rates which are higher than 0.0001 are excluded in this experiment after observing their fluctuation without convergence. So, I found that learning rate below 0.0001 can make our model get to the local minimum, and tested which value is the most effective to obtain high accuracy. In the graph of train loss (Fig. 1), we see that as the learning rate is getting smaller, it converges very slowly. We also see that the learning rates, 0.0001 and 5e-05, are guarantee to converge on training data (Fig. 1), but both generate unstable test loss and accuracy (Fig. 3-4). Therefore, I choose 1e-05 as the learning rate in my model because it let the model to converge in a stable way and generate high test accuracy.

Tuning mini-batch size

num batches = [5, 10, 50, 100, 500]

Figure 5: Train Loss Figure 6: Train Accuracy

Figure 7: Test Loss Figure 8: Test Accuracy

learning rate = 1e-05 hidden units = 10 momentum = 0.8

l2 penalty = 0.001

Analysis: The size of mini batches, surprisingly, does not significantly affect the loss and accuracy for both training and testing (Fig. 5-8). Rather, it influences time performance as it is related with high dimensional computation. As shown in Table 1, extreme choices of the mini batch size such as 10 or 1000, require higher computation time. It happens because mini batch size 10 has to deal with 1000-dimensional matrix computation, mini batch size 1000 has larger iterations of learning although it only deals with 10 samples per a batch. I think this experiment shows that the strong point of stochastic minibatch approach because instead of learning whole examples at one time, we can learn a subset of them in a saved time, and we still can obtain reasonable results. Therefore, I chose mini-batch size with 50 since it shows efficient time perfor- mance without significantly deteriorating the test accuracy.

Mini Batch Size

Test Accuracy(%)

Time Cost(s)
















Table 1: Test accuracy and time cost with different mini batch size

Tuning the number of hidden units hidden units = [5, 10, 50, 100, 1000] learning rate = 1e-05

num batches = 50

momentum = 0.8

l2 penalty = 0.001

Figure 9: Train Loss Figure 10: Train Accuracy

Figure 11: Test Loss Figure 12: Test Accuracy

Analysis: The number of hidden layer units is the most influential parameter to obtain higher test accuracy. From the experiments with different number of hidden layer units, we see that test accuracy keeps increasing as the number of hidden units increase (Fig. 12). It reveals that this image classification can obtain higher accuracy with more a sophisticated neural network model. However, large number of hidden units significantly affects time performance, and from a certain point, the high number of hidden units does not improve test accuracy anymore. Therefore, we need to carefully chose the number of hidden units as considering both computing power and the mount of improvement. In my experiment, 500 would be the good choice for test accuracy if computing resource is allowed, otherwise, unit number 50 is still showing reasonable test accuracy with 500, so, I chose 50 as hidden unit number for the rest training part.

Number of Hidden Units


Time Cost(s)
















Table 2: Test accuracy and time cost with different number of hidden units

Tuning momentum

momentum = [0.0, 0.6, 0.7, 0.8, 0.9]

Figure 13: Train Loss Figure 14: Train Accuracy

Figure 15: Test Loss Figure 16: Test Accuracylearning rate = 1e-05 num batches = 50 hidden units = 50

l2 penalty = 0.001

Analysis: The momentum is an important factor to control the weight changes and make the model converge faster, avoiding gradient decent oscillation along with a learn- ing rate. In my experiments, I tested momentum values from 0.6 to 0.9 and 0.0. In the result graphs (Fig. 13-16), it shows that as momentum values increase until 0.9, it expedites to converge for both training and testing. However, in the Fig. 15, the momentum value, 0.9 and 0.8, shows a little effect of overfitting, going up from a cer- tain point. Therefore, I choose momentum with 0.7 since it shows robustness of test accuracy and faster convergence.

Tuning l2 penalty

l2 penalty = [0.0, 0.001, 0.01, 1, 10]

learning rate = 1e-05 num batches = 50 hidden units = 50 momentum = 0.7

Analysis: L2 penalty plays a role of preventing overfitting and increasing test ac- curacy. In my experiment, very high penalty like 10 is not a good choice because it deteriorates both train and test accuracy (Fig. 17-20). In fact, any other tested values of L2 penalty shows very a little improvement in test accuracy and loss (Table 3). Although the improvement is very trivial, I choose l2 penalty with 1, which shows the highest accuracy among them.

L2 penalty

Test Accuracy (%)











Table 3: Test accuracy with different L2 penalty

In conclusion, tuning parameters with appropriate values is very important to train the model in fast and efficient time as well as to obtain higher test accuracy. Tuning learning rate, momentum is important in the sense of guaranteeing convergence into the local minimum. The proper size of both mini batch and hidden unit is also important to improve time performance and test accuracy. Both hidden unit size and L2 penalty should be well chosen to increase test accuracy as preventing overfitting problem.

Finally, I chose parameter values as below to train and evaluate my model. learning rate = 1e-05

num batches = 50

hidden units = 50 momentum = 0.7 l2 penalty = 1

The performance of my neural network

* What accuracy can you achieve? 83.15%

I finally trained my model with tuned parameters, and obtained 83.15% test accuracy. (although I could increase the accuracy up to 84.65% with 500 hidden units.) Fig. 21 shows the test accuracy of train and test data, and until 100 epochs, overfitting did not happen. Fig. 22 shows the objective error (loss) of train and test, and train error is higher than test error due to the regularization factor. In sum, using L2 penalty parameter (=1), I can prevent overfitting problem, improving test accuracy and loss.

Test Accuracy Figure 22: Train and Test Loss