University of Surrey
School of ECM
University of Surrey
Guildford, Surrey
GU2 5XH, UK

Tel: +44 (0)1483 259823
Fax: +44 (0)1483 876051


 
 



Learning Laws and Learning Equations

Widrow-Hoff Learning

Hebbian Learning

Self-organising Kohonen Maps

Backpropagation Learning

























Widrow-Hoff Learning


 
 

  •  The overall aim is to minimize a cost function based on the error signal, .
  •  Using a mean-square criterion, defined as the MEAN-SQUARE VALUE of the SUM OF SQUARED ERRORS, a cost function J is defined as
  • where


               Þ SUCH KNOWLEDGE is usually not available, hence one has to make simplifying assumption!
     
    we have omitted  from this formula as it was inserted for convenience only.
    J(w) is a quadratic function of w

     




    ONE SHOULD JUST USE EACH  VECTOR AS CORRECTION TO w!
     

  • Widrow and Hoff showed that the weight update law, called variously Widrow/Hoff learning law, the LMS learning law and the delta rule,

  • where
    is a positive constant, aka Rate of Learning usually selected by trial and error through the heuristic that if a is too large,  will take a long time to converge.

    Typically,  with a=0.1 as a usual starting point. CHOOSING AND ADJUSTING a IS AN ART FORM THAT HAS TO BE LEARNT!

    and , when  largest 'eigenvalue' of 
     
     


     

  • Error-correction learning relies on the error signal  to compute  applied to the synaptic weight of the neuron k in accordance with the weight-change equation above.

  •  

  • The error signal is computed from

  •  


     

  • Recall the unit time delay operator z,

  •  

    The internal activation of the neurons is computed by

    and the squashed output by

    where  is a squash function.


     
     


  • The above signal-flow graph illustrates Widrow Learning law: There is a mixture of signals and weights. The transfer function of  and  is represented by  and that of the link between  and that of the link between  to  is represented by .

  • Part of the output becomes the input.

  • ERROR CORRECTION LEARNING BEHAVES LIKE A CLOSED FEEDBACK LOOP. Therefore it is important to select the value of a with great care:
  • SMALL a
    SLOW LEARNING
    LARGE a
    ACCELERATED LEARNING
     
    POSSIBILITY OF DIVERGENCE.

    PROCESSING UNIT CHARACTERISTICS
    TYPE(S) OF MINIMA ON THE SURFACE
    Linear Processing Units

    ADALINE is one example. Starting from an arbitrary point always achieves GLOBAL MINIMA)

    Error surface is a quadratic
    function of the weights; Bowl shaped surface; Unique Minima
    Non-linear Processing Units

    GLOBAL MINIMA possible but not always achievable due to the entrapment by local minima.
     
     

    GLOBAL (MULTIPLE) MINIMA; LOCAL MINIMA.


    Hebbian Learning
  • DONALD HEBB, a Canadian psychologist, was interested in investigating PLAUSIBLE MECHANISMS FOR LEARNING AT THE CELLULAR LEVELS IN THE BRAIN. (see for example, Donald Hebb's (1949) The Organisation of Behaviour. New York: Wiley)

  •  
  • Hebb postulated that
  • When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A's efficiency as one of the cells firing B, is increased.

    THESE LAWS CAUSE WEIGHT CHANGES IN RESPONSE TO EVENTS WITHIN A PROCESSING ELEMENT THAT HAPPEN SIMULTANEOUSLY. THE LEARNING LAWS IN THIS CATEGORY ARE CHARACTERIZED BY THEIR COMPLETELY LOCAL - BOTH IN SPACE AND IN TIME-CHARACTER.
  •  Hebb's laws-proposals more precisely - indicate that associative learning, at the cellular level, which would result in an ENDURING modification in activity pattern of a spatially distributed "assembly of nerve cells".

  •  
  • NEUROCOMPUTING rendering of Hebb's laws:

  •  


     
     

  • A Hebbian synapse is a synapse that uses a time-dependent, highly local, and strongly interactive mechanism to increase synaptic efficiency as a function of the correlation between the pre-synaptic and post-synaptic activities.

  •  

  • LINEAR ASSOCIATOR - A SUBSTRATE FOR HEBBIAN LEARNING-BASED SYSTEMS (Recall, the affine combiner that acted as a substrate for Widrow-Hoff Systems).
  • LINEAR ASSOCIATOR


  • A linear associator network should learn m pairs of input/output vector  when one of the input vectors, say , is entered into the network, the output vector  should be ; when  is entered then the output should be  are small.

  •  
     

  • Recall that we wish to have a linear associator, such when we input a vector , where e is a small magnitude vector, thus , then the output vector should be .

  •  
  • The problem of training the linear associator is to associate  with  is essentially the problem of finding the best weight matrix w.

  •  
     

  • HEBB'S LAW - biological learning law or conjecture:

  •                                                                            new              old

    where


  • Hebb's law only functions when a  vector - correct output - is supplied.

  •  
    The weights  are changed only on learning trials.
  • Hebbian learning starts with a weight matrix whose elements have an initial value set to zero.

  •  
  • Consider the outer product sum for w :

  •  

    Recall


     

    Assuming  is the orthonormal Kronecher's Delta function
     

  • Hebbian learning can be understood in terms of what values the weights take on at the end of training, i.e. after all "m" training examples  have been entered.

  •  
  • If there are "m" pairs of vectors,  to be stored in a network ,then the training sequence will change the weight-matrix, w, from its initial value of ZERO to its final state by simply adding together all of the incremental weight change caused by the "m" applications of Hebb's law:







    • Unlike neural networks trained on, say, Hebb's law, where SEVERAL output neurons may be active simultaneously, networks trained using competitive laws allow only ONE OUTPUT neuron to be active at ANYONE TIME.
  • COMPETITIVE learning networks are trained using the so-called unsupervised regimen and in this training regimen there are no specific examples to be learned by the network.

  •  

  • A competitive learning rule-set:
    1. A set of neurons that are ALL THE SAME except for some randomly distributed synaptic weights and therefore, respond differently to a given set of input patterns.
    2. A limit is imposed (implicitly) on the strength of each neuron.
    3. A mechanism that permits the neurons to compete for the RIGHT TO RESPOND to a given input subset, such that only one output neuron, or only one group of output neurons per group is active or 'on' at a time.
    4. The winning neuron is called WINNER-TAKES-ALL-NEURON.
  • KOHONEN LAYER: The N-processing elements of Kohonen layer each receive n inputs (from fan outs below) . Each input has weight  assigned to it.

  •  
  • When each input vector  is entered into the Kohonen layer, the N processing elements compete on the basis of which term has its weight vector  closest to . This proximity is measured in terms of distance function D.

  •  
  • The winner emits signal , the others emit signal .

  • The  input to Kohonen processing element i has real weight  assigned to it.

  •  

  • Each Kohonen processing element computes its input intensity  according to

  • where  is a distance measurement function.
     

  • can be computed as an Euclidean distance

  • or as a spherical arc distance

    and  are regarded as unit length vectors.
     

  • Usually in KOHONEN nets the Euclidean distance metric is preferred to the spherical arc distance.

  •  
  • Once each Kohonen unit has computed its intensity a competition takes place to see which unit has the SMALLEST INPUT intensity that is which Kohonen unit has its weight vector  closest to .

  •  

  • Implementing the Competition
  •  Synaptic weight : This denotes the synaptic weight of an input node i to an output neuron j.

  •  

    Þ Each neuron is allotted a fixed amount of synaptic weight which is distributed among its input nodes, such that

  •  Training vector : Kohonen Learning Law is a good example of the so-called unsupervised learning law. An essential condition for the training data, used in training a neural net through Kohonen Learning Law, is that such data should really comprise input vectors  drawn at random in accordance with a fixed probability density function. If a fixed probability density function did exist then the Kohonen Law should produce a set of equiprobable weight vector!

  •  
  • Kohonen learning law: A neuron, in a Kohonen map, learns by shifting synaptic weights from its inactive to active input nodes.

  •  

    As each vector is entered into the map, the Kohonen processing elements (denoted by ) compete with one another to determine the winner on the basis of minimum distance.
     
     


     
  • The winning Kohonen processing element is set to 1  and all other elements are set to zero Once a winner has been identified synaptic weights are modified accordingly to Kohonen learning law, also known as the STANDARD COMPETITIVE LEARNING RULE

  •  
  • Kohonen learning law:
  • Winner zi takes all and losers  have nothing

  •  


     
     


    Learning rate parameter a:


     

  • The Kohonen learning law moves the weight vector a fraction a of the way along a straight line from the old weight vector, wi(t), to the vector . (Recall that neurons learn by shifting synaptic weights)

  •  
  • The winning weight can be thought of as being drawn toward the  vector, as though  were able to exert an attractive force only on its nearest weight vector.

  •  

    Þ Exposure to the individual input vectors  draws individual Kohonen unit weight vectors into a cloud near where the vector  appeared as determined by the probability density function.
     

  • 'Values' for learning rate: At the initiation of a training regimen for Kohonen maps, a, the so-called learning rate, is usually set to a value approximately equal to 0.8.

  •  

    Þ As the training proceeds, that is, vectors wi, more into the area of (input) data, then a (or less) for final EQUILIBRATION.

    ______ . _______


  • In the simplest case of competitive learning we may have, the neural network has a SINGLE layer of output nodes, each output node is fully connected to the input nodes.

  •  

    Þ The connections between the output layer nodes are such that these connections perform (lateral) inhibition. Rest of the connections are excitatory.



    The winner-takes-all neurons organise themselves in a SELF ORGANISING FEATURE MAP (SOFM). In an SOFM, the neurons placed at the nodes of a LATTICE which are usually one- or two-dimensional. (Higher dimensions are possible but rarely encountered).

    Recall that the (winning) neurons selectively TUNE themselves to various input patterns or classes of input patterns during the COMPETITIVE PROCESS. The locations of the TUNED, i.e., THE WINNING NEURONS tend to become TOPOLOGICALLY so ordered such that a meaningful CO-ORDINATE SYSTEM for different input FEATURES is created over the lattice.

    A SELF ORGANISING FEATURE MAP is characterised by the formation of a TOPOGRAPHIC MAP OF THE INPUT PATTERNS, IN WHICH THE SPATIAL LOCATIONS OF THE NEURONS IN THE LATTICE CORRESPOND TO THE INTRINSIC FEATURES OF THE INPUT PATTERNS.

    BIOLOGICAL EVIDENCE: The brain, some authors claim, is organised in some places, like the cerebral cortex, so that different sensory inputs (textural, visual and acoustic) are represented by topologically ordered maps.
     
     
     


    A computational map comprises a basic building block in the information-processing infrastructure. A Computational map is defined as:

    A MAP WHICH IS DEFINED BY AN ARRAY OF NEURONS REPRESENTING SLIGHTLY DIFFERENT TUNED PROCESSORS OR FILTERS, THAT OPERATE ON SENSORY INPUT.
     

    Advantages of a computational map:

    Efficient Information Processing
    RAPID HANDLING OF SUBSTANTIAL AMOUNTS OF INFO.
    Simplicity of Access to Processed Information
    SCHEMES OF CONNECTIVITY ARE SIMPLIFIED BY THE MAPS
    Common Form of Representation
    ALLOWS NERVOUS SYSTEMS TO MAKE SENSE VIA SINGLE STRATEGY
    Facilitation of additional interactions
    SHARPENING THE TUNING OF THE PROCESSING
     

    Backpropagation Learning





     

    Steepest Descent
    Least-Mean Square
    1. Weight vectors:
    starts at an initial value w(0) 
    starts at an estimate vector
    2. Trajectory of the vector along the error surface
    follows a precise trajectory,

    provided h is chosen properly the trajectory. The trajectory terminates in the optimum vector wo

    follows a random trajectory (stochastic gradient)
    3. Computation of cost function J(n)
    Minimize the mean squared (ensemble averaged) error J(n)

    Gradient vector at every iteration is 'EXACT'

    Sum of error squares ETotal(n) integrated over and up to n.

    Store estimates of rx and rdx
     

    Minimizes an instantaneous estimate of the cost function J(n).

    Gradient vector is 'RANDOM'

    only instantaneous information is required therefore minimal storage

    .

    Instantaneous sum of squared errors E(n) is computed by summing  over all neurons in the output layers
     
     

  • The average squared error, i.e. the average over all training patterns N, is

  • .

  • En and Eav are functions of free parameters like synaptic weights, w, and thresholds q.

  •  

    Back propagation algorithms is about adjusting the values of the free parameters to minimize e(n), E(n) and Eav
     

  • The key to the backpropagation algorithm is the computation of the gradient of the cost function J. We have simplified this computation of the gradient to the computation of the instantaneous sum of squared errors, E(n) and averaged square error. Assuming that J is differentiable, the direction in which the weight vector is to be moved for J to decrease most rapidly is given by

  • In multi-layered networks the computation of J, rather the computation of E and Eav, is involved because of the input-output relations between the various layers.

  •  
  • Neuron j, learning through back propagation receives input from a set of function signals from its left.

  •  

     






    Net internal activity

    Output from neuron j at iteration n is

  • The LMS method was devised to compute synaptic weight change,  such as to minimize .

  •  
  • Like LMS method, backpropagation algorithm provides a similar prescription for the computation of synaptic weight change:

  • The above is called the delta rule and h is called the LEARNING RATE PARAMETER of the backpropagation algorithm.

  •  
  • Now



  • But E also depends upon, the activation function v, the output function yi and the error function ej. Therefore by CHAIN RULE

  • Þ Recall that 


    Recall that 

    Recall that 


     
     


     



     

    Let 
     


     
     

  •   is called the local gradient inthat it is dependent on the NET INTERNAL ACTIVITY OF THE NEURON j and its error signal.

  •  
  • KEY FACTOR in the computation of SYNAPTIC WEIGHT CHANGE, Dwij, is the ERROR SIGNAL ej(n).

  •  
  •  

  •  

    The above calculation of synaptic weight change applies to a multilayer perception: Dwji is the weight change of connection strengths between a neuron j and all other neurons, i, in the layer to the left of neuron j.
     

  • Now if neuron j is in the output node of given neural network, then the weight change calculation is relatively simple inthat neuron j has its own DESIRED RESPONSE SIGNAL j. But what if the neuron j is not in the output layer, that is, if neuron j is one of the hidden layer neurons?

  •  

     
  • Case I : Neuron j is an output node neuron:

  •  

    = h ej(n) j'(vj(n)) yi(n)
              ­

    dj(n) - yi(n)

    The outstanding problem here is to compute the first-order variation inthe output neuron j with respect to the net internal activity


  • Case II : Neuron j is one of the hidden layer neurons. Here there is no prespecified erorr signal.

  •  

    THE ERROR SIGNAL FOR THE HIDDEN LAYER NEURONS IS TO BE COMPUTED RECURSIVELY IN TERMS OF THE ERROR SIGNALS OF ALL THE NEURONS TO WHICH IT IS CONNECTED!
     

    CASE II : NEURON j IS A HIDDEN NODE
  • ERROR SIGNAL FOR A HIDDEN NEURON HAS TO BE COMPUTED RECURSIVELY IN TERM S OF THE ERROR SIGNALS OF ALL THE NEURONS TO WHCH THE HIDDEN NEURON IS DIRECTLY CONNECTED.

  •  

    Þ Rewrite, the local gradient, defined as,
     


     

    as
     
     






    but 
     
     


  • Recall that the instantaneous sum of squared error is given as

  • NOTE THAT THE INDEX 'k' is used instead of 'j' as 'j' is used to label the hidden layer neuron.

          CASE II : When j is a hidden neuron

    Using the chain rule,



    Remember that


    Also  q = total no. of inputs


     

     

    COMPUTATION OF SYNAPTIC WEIGHT CHANGE


    COMPUTATION OF SYNAPTIC WEIGHT CHANGE
    j = output neuron
    j = hidden neuron
    BOTH INVOLVE COMPUTATION OF 

     



  • Given that

  • \ maximum weight change for neurons whose function signal is in the midrange for a sigmoidal activation function.

    poss. maximum  OUTPUT NEURONS

    IT IS THIS FEATURE, THAT OF MAXIMAL SYNAPTIC WEIGHT CHANGE FOR NEURONS ACTIVE IN THE MIDRANGE (yj(n)@0.5), CONTRIBUTES MUCH TO ITS STABILITY AS A LEARNING ALGORITHM.
     

  • Remember 

  •  

    CASE I: WHEN NEURON 'j' IS IN THE OUTPUT LAYER

    dj(n) = 


     


    CASE II : WHEN NEURON 'j' IS IN AN ARBITRARY HIDDEN LAYER
     
     



     
     


     
     

    RATE OF LEARNING
     
  • BP algorithm provides an 'approximation' to the trajectory in weight space computed by the method of steepest descent.

  •  
  • The learning rate parameter has considerable influence on how smooth the trajectory is in weigh space and (consequently) how quickly the system completes its training.

  •  
  • FOR SMALL h:

  •  

     
     
     


     

  • FOR LARGE h:

  •  

     


     
     

  • Rumelhart, Hinton and Williams1 1(1986) a modification to the synaptic weight change computations, by adding a term such that the learning rate is accelerated without compromising the stability of the network. this is the so-called GENERALISED DELTA RULE

  • Dwji(n) = aDwji(n-1) + hdj(n)yi(n)
                                                                                          ­
    NEW TERM

    a =momentum constant, a small positive number.
     

  • Now Dwji(n) = a(aDwji(n-2) + hdj(n-1)yi(n-1))

  •  

     

                            + hdj(n)yi(n)

                        = a2Dwji(n-2) + h(adj(n-1)yi(n-1)

                           + dj(n)yi(n))

            Dwji(n) = a2(aDwji(n-3) + hdj(n-2)yi(n-2))

                            + h(adj(n-1)yi(n-1) + dj(n)yi(n)


     



     

  • represents the sum of an exponentially weighted time series and for this series to be convergent .

  •  

     

    For a=0 there is no momentum correction.
     

  • If on two consecutive iterations,  hs the same sign, then wji(t) is adjusted by a large amount. The inclusion of momentum rate tends to accelerate the descent.

  •  

     

    Remember



    However, if on two consecutive iterations  has the opposite  shrinks in magnitude and the momentum factor stabilizes the network by small weight changes.