February 17, 1999
Computational Models of learning:
The book Exercises in Rethinking Innateness by Kim Plunket and Jeffrey L. Elman was passed around the class. The supervised learning methods, mainly back-propagation and some variants on it, which were discussed in class are also discussed in this book. The book is a companion to T-learn and the computational methods described in it can be illustrated and practiced using T-learn.
Hebbian learning is modeling of neural long-term potentiation and long-term dispotentiation which were discussed early in the course. There is lots of evidence that neurons do this kind of coincidence learning. When inputs to a neuron coincide with firing, the synaptic connection is strengthened. But this kind of learning is not very powerful computationally because it has no feedback. It can never correct an error. If inputs to a neuron coincide with firing and strengthen a connection but also cause a disaster, there is no way in this kind of learning to then weaken the bad connection based on the outcome.
Reinforcement learning can be viewed as supervised learning where instead of being given specific information about what the system should have done, it is simply given an evaluative measure of goodness or badness of an action. The belief is that this is more biologically plausible, but there are no good computational reinforcement learning models which have been mapped to the biological level. Another issue is the delayed reward. There is often a long chain of actions between some decision and the reward or punishment resulting from that decision. The problem is to get the reward/punishment information back down the chain to the decision.
Recruitment learning is good for language related problems. It is one-shot learning. We have used it a lot and will discuss it in the next lecture.
In unsupervised learning, there are still some criteria for learning; otherwise no learning could occur at all. In unsupervised learning, the criteria for learning are implicit in the system. Often there is a measure of simplicity or similarity. You may try to get a system to reproduce the things it saw. This is used for image compression and similar tasks.
Supervised learning does have explicit feedback. Previously in the course, there was a description of the perceptron in which there are numerous inputs added up and there are weights which can be changed. (See notes for February 18 for a refresher.)
******graphs slide********(see readings)
In these graphs the long line, not marked with a 'w', is a decision region where if you formulate the problem right, you would like all of the data to be on one side of the long line. If its a problem of trying to get the thing to fire if x1, x2 or x3 is input and not fire for other things, then you try to find a position for the weight vectors, marked with 'w'. You can prove that, for things that can be done by perceptrons, there is a simple learning rule to move the weights, which can be visualized as adding a new vector to the original vector. After enough training, the weight vector gets a line such that its perpendicular has all the data one side. This is called linearly separable. Some functions, such as X-or are not linearly separable. This is for a one-layer perceptron.
PDP nets are claimed to be universal approximators; they can approximate any function. This is true and false depending on how it is formulated. If it were truly an universal approximator, that would be a big deal for cognitive science because it could be used to model any brain function. But it isn't really a universal approximator.
In the 50s people had proved that anything a one-layer perceptron could do, it could learn, which is important. However, there are some very simple things which it can't do. They knew that multi-layer perceptrons could do more, but didn't know how to train them.
********multi-layer slide*********(see readings)
Single layer perceptrons can learn linearly separable functions. The most general space it can learn is a line, a single cut through space. If there are overlapping region, you can't separate them with a single line. If there are two layers, the first layer can have several separate perceptrons, each of which makes a single line cut. The second layer can combine those so that you can get convex regions, can do X-or, etc. With three layers, the first layer can make multiple single slices; the second layer can combine sets of those and the third layer can combine the concave or convex regions of the second layer and get an approximation for any-shaped regions.
Another way to look at this is that from old mathematics, we know that we can approximate smooth functions with various kinds of bases, polynomials, Fourrier transforms, etc. The sigmoid functions of neural nets can be thought of as a kind of universal basis for approximating curves. This is the sense in which feed-forward neural nets can be thought of as universal approximators. For back propagation networks, the same applies. There are multiple layers and a sigmoid function is used. With enough layers and sigmoid units, you can represent any (sufficiently simple) smooth function. (See sigmoid function slide from a previous lecture)
However, there are a number of senses in which it's not true that PDP nets are universal approximators. First, it suffers from local minima. The learning rule will not always find the correct approximation even if it's possible to represent any function with a back-prop net. Second, it can't be inverted. As was seen in Regier's thesis, the representations the back-prop net learned was very weak in that it couldn't be used to draw inferences or to create a scene given a label. It was trained only to choose a linguistic label given a particular scene, and the representation it came up with for accomplishing this task cannot be used to choose a scene given a label or other tasks. This is problem in general for back-prop networks. For language learning, also, the back-prop nets take thousands of trials, which is not a good model of language learning.
There are two other computational problems for back-prop nets. First is shift invariance, which is a problem which was known to the ancient Greeks. If you see an object which you have never seen before and you only see it on one position on the retina, you can still recognize it when it occurs in other positions. A computational version of this is trying to get a network to recognize some pattern in a string of input, where the pattern can occur at any position in the input. Suppose you want it to recognize (fire the "good" output node for) the pattern 101 in a binary string of 5 digits. This is not a smooth function in the sense above even though it's perfectly reasonable task.
There are three possible good strings in this example: 10100, 01010, 00101. If the training set includes all three of these nodes, the back-prop net can learn to recognize them as good. But the point of these nets is to generalize. If the training set only include 10100 and 00101, then it has only seen "good" examples with 1 in positions 1 and 3 or 3 and 5 and 0 in positions 2 and 4. When it is tested with 01010, it sees 0 in positions 1, 3, and 5 and 1 in position 2 and 4, so it will say that 01010 is a bad string. The network can't generalize from the training cases to the test case. This is true for this kind of problem even if the training set includes 99.999% of all the good examples. This is because the generalization necessary for this problem is not tied to specific positions.
Even though back prop has this limitation and it has not yet been mapped to the biological level, it is still a very powerful computational tool. It is the best current neural net learning algorithm and is widely applied.
The back-prop method for fixing the shift invariance problem is to link the weights. This is a computational solution and has no biological plausibility. The weights are put into groups and the weights in each group always have the same value, so the weights or set of weights in each layer are constrained to be the same. This kind of weight sharing can be done in T-Learn.
In Terry Regier's system, the shift invariance problem had to be addressed somehow. Regier did use weight sharing for the center-surround cells which were used in the average and maximum overlap feature computations. The weights on those cells were constrained to be the same for the entire map. More generally, Regier used the structured part of the model to compute features from the input scenes which were themselves shift invariant. So the relational orientation features, the reference orientation features, the inclusion and contact features are all shift invariant, so the back-prop net can work with position independent input. This is the best theory about what happens in nature, that learning happens on position-independent features in nature as well.
The second problem for back-prop nets is dynamics. For a lot of things, you need to model things which change over time. It is working with sequential information, essentially shift invariance in time. This is much of what speech recognition is about. Back-prop nets are inherently static, so you can't model anything over time. We can easily draw a finite state machine to do the 101 recognition task. In speech recognition, the Hidden Marcov Models are ones in which you try to have a program figure out such a machine. The finite state machine uses context or state to chose the next action, which is something you can't do in back-prop feed forward nets. It can only generalize for functions which are independent of state. A standard trick for converting a back-prop net into a state machine is by having a set of context nodes.
*******context nodes slide*******(see readings)
In the slide above, there are examples of Elman nets. The (a) example has a standard back-prop net with an added feature. After every time step, the hidden units from the previous time step are copied down and treated as extra input. This give a context. It is done mechanically. T-learn was made for doing this kind of net. Example (b) is a variation called Jordan nets in which you can copy the output units down as extra context. Or you can have the same context feed in with a discount alpha. Examples (c) and (d) show other ways in which context can be worked into the net. The important thing is that with these nets, you can keep the back-prop mathematics, because all of these backward connections are done with fixed weights and no learning. In training these nets, you only train the forward connections.
********badiiguuu*********(see readings)
The slide above has an example which is used in the Elman and Plunket book of a predictor which predicts the next letter in a sequence. The Elman net can do this fairly well. This is intended to give something like the flavor of natural languages. The consonants are always followed by the same vowel and the vowels come in singles, pairs or triples. It is possible to train such a net to be a reasonably good predictor that given a 'u' another 'u' is likely next, but given an 'a', no 'a' is next.
The main reason that's important to cognitive science is that Elman nets are popular in cognitive science to model certain cognitive functions. They are often used for studies because some answer is guaranteed, but it take some work to do it in a way which is insightful and makes non-trivial statements about the phenomenon. But that is more difficult and it is easy to use an Elman net to get some results quickly even if the results are not very informative.
In cognitive science, there are two basic styles of connectionist models, the structured ones, which we have talked about in this course, and the functional ones of the PDP style which this lecture has been about. In the structured style, there are two types. One is directly neural, actually building a model of e.g., the motor system at the neural computation level. Then there are conceptual models which is more like Bailey's model; these models use the techniques of connectionist models but are not claimed to map directly to neurons. Regier's model is a hybrid, doing some structured and some PDP style computation.
In the PDP models, some are existence proofs in which they show that a learning algorithm can learn some phenomenon which has been claimed to be innate. If the algorithm can learn it, then it cannot be assumed to be innate. The point is to show that you don't have to assume everything is innate. The other thing people do is to use the models to do data fitting. These PDP models are rather good at data-fitting. They say the neural models are neural in character and they fit the data, but they don't say that the way they use the neural elements (nodes) is the way the brain does it.
The hybrid models between functionalist and structured assume that there is some structure in the model and combine that with standard PDP nets.
**********Back propagation in iterative nets.**********(see readings)
The T-Learn program can also do iterative nets. These take a kind of
arbitrary connectionist net and applying the standard back-prop algorithm
to it by unwinding it in time. This is called back-prop through time. It's
weight sharing through time. Take the network in the slide above. Each
layer is a time step from the bottom t0 to the top t3. Weight 1 connects b
to a and it happens at each time step. Suppose that you have the values of
a, b and c at the beginning and then after some iterations, then you can
use back-prop to solve for the weights that you need in order to do this.
The trick is that the weights will have to be the same because it really is
the same physical weight. You do back-prop on the extended network and take
the average change of a weight over the time steps as the final change of
that weight. With an arbitrary net and a certain number of time steps, you
can unroll it in time and use the back-prop method for learning. This
doesn't work very well in practice because as back-prop has to go through
many layers, it doesn't work as well. If several iterations are used, the
back-prop method begins to work less well.