![]() |
School of ECM University of Surrey Guildford, Surrey GU2 5XH, UK |
Tel: +44 (0)1483 259823 Fax: +44 (0)1483 876051 |
Learning Laws and Learning Equations

where
![]()

|
|

![]()
where E is the expectation operator.
|
Widrow and Hoff showed that the weight update law, called variously Widrow/Hoff learning law, the LMS learning law and the delta rule,
where
ais a positive constant, aka Rate of Learning usually selected by trial and error through the heuristic that if a is too large,
will take a long time to converge.
Typically,
with a=0.1 as a usual starting point. CHOOSING AND ADJUSTING a IS AN ART FORM THAT HAS TO BE LEARNT!
and
, when
largest 'eigenvalue' of
![]()
![]()
Error-correction learning relies on the error signal to compute
applied to the synaptic weight of the neuron k in accordance with the weight-change equation above.
The error signal is computed from
![]()
Recall the unit time delay operator z,
![]()
The internal activation of the neurons is computed by
![]()
and the squashed output by
![]()
where
is a squash function.
![]()
The above signal-flow graph illustrates Widrow Learning law: There is a mixture of signals and weights. The transfer function of and
is represented by
and that of the link between
and that of the link between
to
is represented by
.
Part of the output becomes the input.
ERROR CORRECTION LEARNING BEHAVES LIKE A CLOSED FEEDBACK LOOP. Therefore it is important to select the value of a with great care:
SMALL a |
|
LARGE a |
ACCELERATED LEARNING |
POSSIBILITY OF DIVERGENCE.
|
|
|
Error surface is a quadratic |
|
GLOBAL (MULTIPLE) MINIMA; LOCAL MINIMA. |
|
Hebb proposed that changes in the efficacy of synapses take place via GROWTH OF SYNAPTIC KNOBS.
THESE LAWS CAUSE WEIGHT CHANGES IN RESPONSE TO EVENTS WITHIN A PROCESSING ELEMENT THAT HAPPEN SIMULTANEOUSLY. THE LEARNING LAWS IN THIS CATEGORY ARE CHARACTERIZED BY THEIR COMPLETELY LOCAL - BOTH IN SPACE AND IN TIME-CHARACTER. |
Hebb's laws-proposals more precisely - indicate that associative learning, at the cellular level, which would result in an ENDURING modification in activity pattern of a spatially distributed "assembly of nerve cells".
NEUROCOMPUTING rendering of Hebb's laws:
![]()
A Hebbian synapse is a synapse that uses a time-dependent, highly local, and strongly interactive mechanism to increase synaptic efficiency as a function of the correlation between the pre-synaptic and post-synaptic activities.
LINEAR ASSOCIATOR - A SUBSTRATE FOR HEBBIAN LEARNING-BASED SYSTEMS (Recall, the affine combiner that acted as a substrate for Widrow-Hoff Systems).
|
|
LINEAR ASSOCIATOR
A linear associator network should learn m pairs of input/output vector when one of the input vectors, say
, is entered into the network, the output vector
should be
; when
is entered then the output should be
are small.
Recall that we wish to have a linear associator, such when we input a vector , where e is a small magnitude vector, thus
, then the output vector should be
.
The problem of training the linear associator is to associate with
is essentially the problem of finding the best weight matrix w.
AssumingHEBB'S LAW - biological learning law or conjecture:
new oldwhere
Hebb's law only functions when a vector - correct output - is supplied.
The weights are changed only on learning trials.
Hebbian learning starts with a weight matrix whose elements have an initial value set to zero.
Consider the outer product sum for w :
Recall
![]()
Hebbian learning can be understood in terms of what values the weights take on at the end of training, i.e. after all "m" training examples have been entered.
If there are "m" pairs of vectors, to be stored in a network ,then the training sequence will change the weight-matrix, w, from its initial value of ZERO to its final state by simply adding together all of the incremental weight change caused by the "m" applications of Hebb's law:
|
|

![]()
where F is a function of post-synaptic and pre-synaptic activities.
|
|

![]()
helps us in describing the possible changes in the weights during training
cycle.
| CASE I: If
CASE II: If
then
proportional to the post synaptic activity CASE III: If |

and that the generalized activity rule is

The above equation shows that the first term on the RHS shows CORRELATION
between the pre- and post-synaptic activities and that the repeated application
of
will lead to an exponential
growth.
The second term on the RHS, the non-linear forgetting factor, correlation
between post-synaptic activity and weight change, will help in arresting
the predicted exponential growth due to the first term on the RHS.
|
COMPETITIVE learning networks are trained using the so-called unsupervised regimen and in this training regimen there are no specific examples to be learned by the network.
A competitive learning rule-set:
|
KOHONEN LAYER: The N-processing elements of Kohonen layer each receive n inputs (from fan outs below) . Each input has weight
assigned to it.
When each input vector is entered into the Kohonen layer, the N processing elements compete on the basis of which term has its weight vector
closest to
. This proximity is measured in terms of distance function D.
The winner emits signal , the others emit signal
.
The input to Kohonen processing element i has real weight
assigned to it.
Each Kohonen processing element computes its input intensity according to
where
is a distance measurement function.
can be computed as an Euclidean distance
or as a spherical arc distance
![]()
and
are regarded as unit length vectors.
Usually in KOHONEN nets the Euclidean distance metric is preferred to the spherical arc distance.
Once each Kohonen unit has computed its intensity a competition takes place to see which unit has the SMALLEST INPUT intensity that is which Kohonen unit has its weight vector closest to
.
Implementing the Competition 1. The scheduling unit of the Kohonen layer can simply get
inputs from each of the Kohonen processing elements. These input,
, are then sorted by site. Then output and broadcast the number of the winning unit to all the other Kohonen units.
2. Lateral inhibition ® on-centre/off surround - is used to ensure that only the Kohonen unit with the smallest distance ends upactive. Finally, each processing unit compares its input intensity
value to those received from other processing elements to see which value is smaller.
Synaptic weight : This denotes the synaptic weight of an input node i to an output neuron j.
Þ Each neuron is allotted a fixed amount of synaptic weight which is distributed among its input nodes, such that
Training vector : Kohonen Learning Law is a good example of the so-called unsupervised learning law. An essential condition for the training data, used in training a neural net through Kohonen Learning Law, is that such data should really comprise input vectors
drawn at random in accordance with a fixed probability density function. If a fixed probability density function did exist then the Kohonen Law should produce a set of equiprobable weight vector!
Kohonen learning law: A neuron, in a Kohonen map, learns by shifting synaptic weights from its inactive to active input nodes.
As each vector is entered into the map, the Kohonen processing elements (denoted by
) compete with one another to determine the winner on the basis of minimum distance.
The winning Kohonen processing element is set to 1 and all other elements are set to zero
Once a winner has been identified synaptic weights are modified accordingly to Kohonen learning law, also known as the STANDARD COMPETITIVE LEARNING RULE
Kohonen learning law: Winner zi takes all and losers have nothing
![]()
![]()
Learning rate parameter a:
![]()
The Kohonen learning law moves the weight vector a fraction a of the way along a straight line from the old weight vector, wi(t), to the vector . (Recall that neurons learn by shifting synaptic weights)
The winning weight can be thought of as being drawn toward the vector, as though
were able to exert an attractive force only on its nearest weight vector.
Þ Exposure to the individual input vectors
draws individual Kohonen unit weight vectors into a cloud near where the vector
appeared as determined by the probability density function.
'Values' for learning rate: At the initiation of a training regimen for Kohonen maps, a, the so-called learning rate, is usually set to a value approximately equal to 0.8.
Þ As the training proceeds, that is, vectors wi, more into the area of (input) data, then a (or less) for final EQUILIBRATION.
______ . _______
In the simplest case of competitive learning we may have, the neural network has a SINGLE layer of output nodes, each output node is fully connected to the input nodes.
Þ The connections between the output layer nodes are such that these connections perform (lateral) inhibition. Rest of the connections are excitatory.
The winner-takes-all neurons organise themselves in a SELF ORGANISING FEATURE MAP (SOFM). In an SOFM, the neurons placed at the nodes of a LATTICE which are usually one- or two-dimensional. (Higher dimensions are possible but rarely encountered).Recall that the (winning) neurons selectively TUNE themselves to various input patterns or classes of input patterns during the COMPETITIVE PROCESS. The locations of the TUNED, i.e., THE WINNING NEURONS tend to become TOPOLOGICALLY so ordered such that a meaningful CO-ORDINATE SYSTEM for different input FEATURES is created over the lattice.
A SELF ORGANISING FEATURE MAP is characterised by the formation of a TOPOGRAPHIC MAP OF THE INPUT PATTERNS, IN WHICH THE SPATIAL LOCATIONS OF THE NEURONS IN THE LATTICE CORRESPOND TO THE INTRINSIC FEATURES OF THE INPUT PATTERNS.
BIOLOGICAL EVIDENCE: The brain, some authors claim, is organised in some places, like the cerebral cortex, so that different sensory inputs (textural, visual and acoustic) are represented by topologically ordered maps.
A computational map comprises a basic building block in the information-processing infrastructure. A Computational map is defined as:A MAP WHICH IS DEFINED BY AN ARRAY OF NEURONS REPRESENTING SLIGHTLY DIFFERENT TUNED PROCESSORS OR FILTERS, THAT OPERATE ON SENSORY INPUT.
Advantages of a computational map:
Efficient Information Processing |
RAPID HANDLING OF SUBSTANTIAL AMOUNTS OF INFO. |
Simplicity of Access to Processed Information |
SCHEMES OF CONNECTIVITY ARE SIMPLIFIED BY THE MAPS |
Common Form of Representation |
ALLOWS NERVOUS SYSTEMS TO MAKE SENSE VIA SINGLE STRATEGY |
Facilitation of additional interactions |
SHARPENING THE TUNING OF THE PROCESSING |

therefore
or
Þ THIS METHOD CONTINUALLY SEEKS THE BOTTOM POINT OF THE ERROR
SURFACE OF THE ONE-NEURON OUTPUT NEURAL NETWORK AKA WEINER'S SPACIAL
FILTER.
Recall that
(n) ![]()
(n)
n actually refers to iteration number.
![]()
This is the method of steepest descent in its purest form.
![]()


estimate of the weights
of the spatial filter

|
|
|
|
|
|
|
|
|
|
|
|
Minimize the mean squared (ensemble averaged) error J(n) |
Minimizes an instantaneous estimate of the cost function J(n). |
- The instantaneous value of the squared error for neuron j is
.
Instantaneous sum of squared errors E(n) is computed by summing
over all neurons in the output layers
The average squared error, i.e. the average over all training patterns N, is
.
En and Eav are functions of free parameters like synaptic weights, w, and thresholds q.
Back propagation algorithms is about adjusting the values of the free parameters to minimize e(n), E(n) and Eav
The key to the backpropagation algorithm is the computation of the gradient of the cost function J. We have simplified this computation of the gradient to the computation of the instantaneous sum of squared errors, E(n) and averaged square error. Assuming that J is differentiable, the direction in which the weight vector is to be moved for J to decrease most rapidly is given by
In multi-layered networks the computation of J, rather the computation of E and Eav, is involved because of the input-output relations between the various layers.
Neuron j, learning through back propagation receives input from a set of function signals from its left.
Net internal activity
Output from neuron j at iteration n is
The LMS method was devised to compute synaptic weight change, such as to minimize
.
.
Like LMS method, backpropagation algorithm provides a similar prescription for the computation of synaptic weight change:
The above is called the delta rule and h is called the LEARNING RATE PARAMETER of the backpropagation algorithm.
Now
But E also depends upon, the activation function v, the output function yi and the error function ej. Therefore by CHAIN RULE
Þ Recall that
![]()
Recall that
![]()
Recall that
![]()
![]()
![]()
![]()
![]()
Let
![]()
![]()
is called the local gradient inthat it is dependent on the NET INTERNAL ACTIVITY OF THE NEURON j and its error signal.
KEY FACTOR in the computation of SYNAPTIC WEIGHT CHANGE, Dwij, is the ERROR SIGNAL ej(n).
The above calculation of synaptic weight change applies to a multilayer perception: Dwji is the weight change of connection strengths between a neuron j and all other neurons, i, in the layer to the left of neuron j.
Now if neuron j is in the output node of given neural network, then the weight change calculation is relatively simple inthat neuron j has its own DESIRED RESPONSE SIGNAL j. But what if the neuron j is not in the output layer, that is, if neuron j is one of the hidden layer neurons?
CASE II : NEURON j IS A HIDDEN NODECase I : Neuron j is an output node neuron:
![]()
= h ej(n) j'(vj(n)) yi(n)
dj(n) - yi(n)
The outstanding problem here is to compute the first-order variation inthe output neuron j with respect to the net internal activity
Case II : Neuron j is one of the hidden layer neurons. Here there is no prespecified erorr signal.
THE ERROR SIGNAL FOR THE HIDDEN LAYER NEURONS IS TO BE COMPUTED RECURSIVELY IN TERMS OF THE ERROR SIGNALS OF ALL THE NEURONS TO WHICH IT IS CONNECTED!
ERROR SIGNAL FOR A HIDDEN NEURON HAS TO BE COMPUTED RECURSIVELY IN TERM S OF THE ERROR SIGNALS OF ALL THE NEURONS TO WHCH THE HIDDEN NEURON IS DIRECTLY CONNECTED.
Þ Rewrite, the local gradient, defined as,
CASE II : When j is a hidden neuron
as
but
![]()
Recall that the instantaneous sum of squared error is given as
NOTE THAT THE INDEX 'k' is used instead of 'j' as 'j' is used to label the hidden layer neuron.
![]()
Using the chain rule,
Remember that
Also
q = total no. of inputs
![]()
|
|
COMPUTATION OF SYNAPTIC WEIGHT CHANGE

Þ and, the synaptic weight change for the hidden neuron j is given
as
Þ Computation of the local gradient depends solely on the activation function of neuron j. The weighted sum of the (output) layer's focal gradient (dk) is a CONSTANT NUMBER!
(a) choosing an 'appropriate' learning rate
(b) computing the local gradients dj(n), which means the computation
of the derivative of the activation function j.
THE EXISTENCE of this derivative requires that j(×) is continuous.

The above function is NON-SYMMETRIC
.
It has been argued that in order to make the BP network faster it is important to choose an asymmetric function:

Since
then. How?
COMPUTATION OF SYNAPTIC WEIGHT CHANGE
NON-SYMMETRIC SIGMOIDAL NON LINEAR FUNCTION
|
|
|
|
|
![]() |
BOTH INVOLVE COMPUTATION OF
- Now
Given that
\ maximum weight change for neurons whose function signal is in the midrange for a sigmoidal activation function.
poss. maximum
OUTPUT NEURONS
![]()
IT IS THIS FEATURE, THAT OF MAXIMAL SYNAPTIC WEIGHT CHANGE FOR NEURONS ACTIVE IN THE MIDRANGE (yj(n)@0.5), CONTRIBUTES MUCH TO ITS STABILITY AS A LEARNING ALGORITHM.
Remember
CASE I: WHEN NEURON 'j' IS IN THE OUTPUT LAYER
dj(n) =
![]()
![]()
|
|
CASE II : WHEN NEURON 'j' IS IN AN ARBITRARY HIDDEN LAYER
RATE OF LEARNING
BP algorithm provides an 'approximation' to the trajectory in weight space computed by the method of steepest descent.
The learning rate parameter has considerable influence on how smooth the trajectory is in weigh space and (consequently) how quickly the system completes its training.
FOR SMALL h:
![]()
FOR LARGE h:
![]()
![]()
Rumelhart, Hinton and Williams1 1(1986) a modification to the synaptic weight change computations, by adding a term such that the learning rate is accelerated without compromising the stability of the network. this is the so-called GENERALISED DELTA RULE
Dwji(n) = aDwji(n-1) + hdj(n)yi(n) NEW TERM a =momentum constant, a small positive number.
Now Dwji(n) = a(aDwji(n-2) + hdj(n-1)yi(n-1))
+ hdj(n)yi(n)
= a2Dwji(n-2) + h(adj(n-1)yi(n-1)
+ dj(n)yi(n))
Dwji(n) = a2(aDwji(n-3) + hdj(n-2)yi(n-2))
+ h(adj(n-1)yi(n-1) + dj(n)yi(n)
![]()
|
represents the sum of an exponentially weighted time series and for this series to be convergent
.
For a=0 there is no momentum correction.
If on two consecutive iterations, hs the same sign, then wji(t) is adjusted by a large amount. The inclusion of momentum rate tends to accelerate the descent.
Remember
However, if on two consecutive iterations
has the opposite
shrinks in magnitude and the momentum factor stabilizes the network by small weight changes.