In this article I compare the learning curves of 2 different topologies of MLP when modifying the hyperparameters η and α of the Gradient Descent algorithm used to learn the XOR function. This article follows another one (see Gradient Descent Algorithm: Impact of η and α on the learning curve of MLP for XOR ) where I study the impact of η and α on the learning curve of a MLP with 241 topology.
Gradient Descent Algorithm
For an introduction to MLP, XOR and learning curve, see my previous post Capability of the MLP to learn XOR and an overview of Deep Learning , see Deep Learning in Neural Networks: An Overview.
For an introduction to Backpropagation and Gradient Descent or Stochastic Gradient Descent algorithm (SGD) see:
- Breaking down Neural Networks: An intuitive approach to Backpropagation
- Neural Networks: Feedforward and Backpropagation Explained & Optimization
- Wikipedia Backpropagation
- Intuitive Introduction to Gradient Descent
- Introduction to Stochastic Gradient Descent
- Wikipedia gradient descent
- Wikipedia Stochastic gradient descent
The MLP
For my tests, I chose to use a MLP with the tanh function as the activation function of the neurons. At the beginning of each training session, all the weights of each link from one neuron to the neurons of the next layer were set to a random value.
The cost function used is the mean squared error loss function (MSE).
Following Rumelhart, Hinton and Williams article, the weight of each link between neurons is updated using the formula:
Δ Weight = η * ∇MSE + α * Preceding Δ Weight
New Weight = Old Weight + Δ Weight
I ran my tests using 2 topologies:
- 2 4 1 (2 neurons on the entry layer, 4 neurons on the hidden layer, 1 neuron on the output layer).
- 2 4 4 1 (2 hidden layers with 4 neurons on each layer)
Results of the tests
A picture being worth a 1000 words, let’s see the results of the 2 learning curves for different η and α.
Let’s see the learning curves for η = 0.1
Let’s see the learning curves for η = 0.2
Let’s see the learning curves for η = 0.6
The main lessons we can learn from all these comparisons are:
- Adding a layer smooths the learning curve.
- Adding a layer usually increases the number of entries in the training set necessary to learn XOR with a 10% accuracy.
- It exists some value of η (around 0.2 for 2441 topology) for which adding a layer does not increase the number of entries in the training set necessary to learn XOR with a 10% accuracy