Gradient Descent
Just as the perceptron learning rule is guaranteed to succeed, so too is gradient descent. Of course a similar set of bounding constraints exist for gradient descent...
- The training set is linearly separable
- The learning rate is sufficiently small
This stands even with noisy data, and in the case of gradient descent.
- x - Input activity vector
- w - Weight vector
- t - Target perceptron output
- o - Perceptron output (pre-_activation_function, i.e. output of the combination function)
- z - Perceptron quantized output (post-_activation_function)
- b - Bias node
- D - Training set
- η - Learning rate
The Gradient Descent Algorithm
To adjust weights properly, one applies a general method for non-linear optimization that is called gradient descent. For this, the derivative of the error function with respect to the network weights is calculated, and the weights are then changed such that the error decreases (thus going downhill on the surface of the error function). For this reason, back-propagation can only be applied on networks with differentiable activation functions.
The stochastic gradient descent method selects training pairs at random rather than cycling through them as is the case with batch gradient descent.
The sum of squared error function is defined as follows...

...where E defines the error landscape of the weight space. Given a particular training set, the error space can look something like this for a perceptron with 2 inputs (and hence 2 weights)...

The aim is to minimize E by adjusting the weight vector, and this is done by taking the steepest downhill direction; the steepest downhill descent is obviously a function of the first derivative of the sum of squared error function, which can be expressed as follows...
![$\nabla\mbox{E}\left( w \right)\; \equiv \; \left[ \frac{\partial \mbox{E}}{\partial w_{0}},\; \frac{\partial \mbox{E}}{\partial w_{1}},\; ...,\; \frac{\partial \mbox{E}}{\partial w_{n}} \right]$](http://ai.autonomy.net.au/tracmath/45aaa537bcb0861de15c5752a64e3b57c8697152.png)
This is used in the training process, readjusting the weight vector as follows...

So to be able to make use of this, the derivative needs to be expressed in terms of the input vector x.
For the remainder of this paper, we will assume that the combination function is the adder function, which is simply the dot product between x the input vector and w the weight vector. The activation function is assumed to be the sigmoid (logistic) function.

For proof on the derivative of the sigmoid function, see the sigmoid page.

Threshold Activation Function
Recall that in the case of the threshold function, there is no gradient as such, and hence the weight vector update is not based on a gradient, rather calculated as follows...

For gradient descent method to be possible however, both the activation function and the error function must be differentiable. Two such examples have been illustrated in the following sections.
Identity Activation Function
Where the activation function is the identity function, we can derive the error function to be as follows...

From this, we can calculate the new weight vector as follows...

Sigmoid Activation Function
On the other hand, if the activation function was the sigmoid, we would instead get the following...

From this, we can calculate the new weight vector as follows...

Batch-Mode Gradient Descent (BGD)
In BGD, the entire error for one sweep is calculated, and then all weight vectors adjusted accordingly.

In BGD, the error is computed based on the entire sweep of the training set, and a small downwards step on the error function is made, based on this error.
Incremental (Stochastic) Gradient Descent (IGD)
In contrast to BGD, in IGD for each epoch in each sweep, the weight vectors are updated.

In IGD, the slope of the error function is calculated with respect only to one training example, following which a small step is taken to minimize the error; this is then repeated for the remainder of the training examples.
IGD can approximate BGD arbitrarily closely if the learning rate (η) is small enough.
References
Russell, S. and Norvig, P. (2003). Artificial Intelligence, A Modern Approach (2nd ed.). New Jersey, Pearson Education International. Chapter 20
Haykin, S. (1999). Neural Networks, A Comprehensive Foundation (2nd ed.). Upper Saddle River, New Jersey, Prentice Hall
Related
Ancestors ☣ Algorithm ➢ Algorithm/Learning
Other ☣ Learning/Machine/Perceptron ☣ Algorithm/Learning/BackPropagation ☣ Mathematics/Sigmoid

