Gradient Descent Algorithm: Impact of eta and alpha on the learning curve of MLP for XOR

In their famous article (see Learning Representations by Back-propagating Errors) Rumelhart, Hinton and Williams popularized the backpropagation mechanism in order to modify the weights of the links between neurons of different layers. They also added the momentum (parameter α alpha) to the classical learning rate (η eta) as a way to improve the Gradient Descent algorithm (learning mechanism). In this post, we will study the resulting learning curves of a MLP for the XOR function when modifying hyperparameters η(eta) and α(alpha).

Gradient Descent Algorithm

For an introduction to MLP, XOR and learning curve, see my previous post Capability of the MLP to learn XOR. For an overview of Deep Learning , see Deep Learning in Neural Networks: An Overview

For an introduction to Backpropagation and Gradient Descent or Stochastic Gradient Descent algorithm (SGD) see:

The MLP

For my tests, I chose to use a MLP with the tanh function as the activation function of the neurons. At the beginning of each training session, all the weights of each link from one neuron to the neurons of the next layer were set to a random value.

The cost function used is the mean squared error loss function (MSE).

Following Rumelhart, Hinton and Williams article, the weight of each link between neurons is updated using the following formula:
Δ Weight = η * ∇MSE + α * Preceding Δ Weight
New Weight = Old Weight + Δ Weight

I ran my tests using 2 topologies:

2 4 1 (2 neurons on the entry layer, 4 neurons on the hidden layer, 1 neuron on the output layer)
2 4 4 1 (2 hidden layers with 4 neurons on each layer)

Results of the tests

A picture being worth a 1000 words, let’s see the results for a 241 topology and η=0

As we can see, having a learning rate of 0 implies that the MLP seems unable to learn XOR function.

Let’s see now the learning curves for a 241 topology and η = 0.1

when α = 0.9, the learning curve starts to be irregular. when α = 1, the MLP seems unable to learn the XOR function.

Let’s see the learning curves for a 241 topology and η=0.2

As we can see, for α > 0.9, the MLP seems unable to learn the XOR function. For α = 0.8 the learning curve is not smooth.

Let’s see the learning curves for a 241 topology and η=0.6

As we can see, for all alpha the learning curve is not smooth and for all α >= 0.6, the MLP seems unable to learn the XOR function.

Let’s see the learning curves for a 241 topology and α=0

As we can see, the MLP is able to learn the XOR function but needs more entries in the training set as η increases. These graphs show that adding the α parameter improves the capability of the MLP to learn the XOR function.

In my next article, I will compare these results obtained for a 241 topology with the ones obtained for a 2441 topology.

MLP XOR

Gradient Descent Algorithm: Impact of eta and alpha on the learning curve of MLP for XOR