February 10, 1999
5 Models of learning: Hebbian, Reinforcement, Recruitment, Supervised, Unsupervised.
This lecture will focus on recruitment and supervised learning (back propagation, you've used it in T-Learn). Recruitment learning is 1 trial learning.
There was a demonstration in which the class was asked if they remember where Jerry Feldman was born. Many people remembered that he was born in Pittsburgh. This is an example of recall, the ability to remember something previously learned, and isn't cued. Recognition is a the process of remembering when you're told the answer. How do we do recognition and recall? We know that we don't grow new connections immediately because that takes too long, but something happens when we learn arbitrary new facts. The brain has to be wired so we can capture associations very quickly. We don't know how this works, but there's a computational model for it.
There was a slide about finding the connection. If there is a node A which represents a person and node B which represents a city, could you have enough links to get an association between them? There're about 1000 outputs from a neuron. There is probably not a direct connection between A and B, but if there are intermediate connections between A and B, there is very likely to be a connection. A connects to 1000 neurons which connect to 1000 more (1,000,000 connections from A). These can connect to 1000 more each (1,000,000,000 connections from A). With this many connections, it is very likely that we can find a connection from A to B. If the probability is computed of finding no link from A to B in a network of certain layers, in networks with no layers, there's a good chance of no connection. In networks with 1 intermediate layer, there is a 2/3 chance of finding a connection. In networks with two intermediate layers, there is a 10 to the -440 chance of not finding a connection. Computationally, with random or pseudo-random connections, there's plenty of representational capacity to make arbitrary links.
(Note: The "nodes" or "cells" which represent people and cities (or other entities) are a shorthand for a set of neurons which the brain uses to represent an entity, according to the models we are using.)
The question, then, is how does the brain find the right connection? A computational story is recruitment learning. A slide was put up which has two nodes A and B on the left, two nodes a and b on the right, and these are linked to four intermediate nodes, Aa, Ab, Ba, Bb. If we want to learn that A goes with b, and B goes with a, what kind of mechanism will lead to activating the right connections? There will be spreading activation, so if A is active, Aa will be active and a will get some activation. The same is true of all the other links, and activation spreads in both directions. There will be mutual inhibition. Aa inhibits Ba, Ab inhibits Bb and vice versa. And Aa inhibits Ab, Ba inhibits Bb and vice versa. Spreading activation and mutual inhibition are computational primitives which we know occur in the brain. The intermediate node has a threshold such that if both it is activated from both directions, it will be highly activated. So if A and b are active, they will activate Ab, which will inhibit Aa and Bb. In the model, this causes long term potentiation of the link between Ab and A and b, this is recruiting Ab, so that if A alone is activated, activation will spread through Ab and activate b. The change in Ab which allows this kind of association is electrical, short term memory, then chemical, intermediate term memory. It can become permanent over a long time (see the 10 steps to long-term change in the last lecture). There are parts of the brain , including the hippocampus,which are dedicated to making associations between other nodes.
This kind of structure is similar to triangle nodes which we have seen before. The difference is that the relationship between the three nodes depends on the state of the nodes at the time of activation. A qualitative model was presented of how the recruitment process might go. If the middle node is idle and it gets activation from one end, it will have low activation. If it gets activation from both ends (coincidental activation), it will have high activation and keep some residual activation, so that if it then gets activation from only one end, it will get high activation and in turn activate the node on the other end. This is how it gets recruited to serve as the link between A and b, etc.
This model doesn't talk about different strengths of connections, and there is a technical problem in that it has a unit for each combination Aa, Ab, etc. As we have seen before, this kind of representation would require more neurons than the brain has. Solutions to this cross-product problem will be presented later in the course.
***blue/green story slide here ***
There is a related type of recruitment learning which is actually more important for our purposes. In addition to random layers, if we want to learn concepts, need more than just layers. For instance, if you want to learn a concept which involves features such as color and shape, you need links between the notion of color and various colors and between the notion of shape and particular shapes. The slide above illustrates this kind of learning. For instance, if your sister gets a new Frisbee, you can instantly form a concept of it, including its shape and color. These concepts seem to be gathered at a central node. In the middle box, the upper node is the concept node. It is connected to at least one triangle node in boxes X and Y. If we see the Frisbee and it is blue and round, we want to recruit a node in the middle as the concept of your sister's Frisbee, and you want the features blue and round to be connected to that concept. The representational story is that you have lots of these nodes in your brain which aren't representing anything in particular. These are connected in such a way, so that if you recruit the right one you can get nice properties. A concept node in the middle is connected to a node in X which is connected to has-color and blue, and to a node in Y which is connected to has-shape and round. We want to strengthen the right connections and lose the unhelpful ones. So if the node in X also has a link to green, then we want to lose that connection. Connections do get cut off in the brain. So the representational power is there in a random network. Computationally, if you assume that the nodes that get multiple activations are more excited, then you can have a general signal that says, for learning a new concept, each node with high activation should strengthen its active connections and weaken its inactive connections. So afterwards, the connections which were activated by blue and round in the Frisbee concept have stronger connections and are better tuned to the middle concept node, and when the Frisbee concept is activated later, it is more likely to activate the triangle nodes for blue/has-color and round/has-shape. This shows how recruitment learning can be used for pairing features and also for learning concepts with multiple features.
It's okay to have one node for each concept because we have much fewer than 100 billion concepts (the average educated person knows 100,000 words, for instance). Even a redundant representation with 100 nodes per concept is fine. We don't have the representational capacity to have a node for every pairing of concepts; that's why we need these kinds of systems.
Supervised learning is not used for immediate learning, linking or concept formation; it is used for slight adjustment and connecting weights. Supervision is necessary for the system to know how to correct the weights. There was a slide of the original Minsky and Papert perceptron, which is also in the reader (Reading 10, McCleland and Rumelhart, p. 122). The detectors detect features on an image, and a single unit would produce a 1 if the weighted input were greater than some threshold and 0 otherwise. This could learn some patterns and not others. The threshold is the same as the negative of the bias, as in T-learn. The bias is normally used today, because it has weights that can be changed as with other links. The perceptron has a learning rule which would work for anything the perceptron can compute. The learning rule uses the error or delta, which is the target output (t) minus the actual output (y). You change a weight depending on whether the answer was correct and whether the input node fired. If the input node was zero, it doesn't matter what its weight was. But if an input was active, then its weight does matter. If t-y is positive, the weight of active units should be increased and the threshold should be decreased. If t-y is negative, the weight should be increased and the weight of active nodes should be decreased. So the change rule is the change in w = (t-y)yj. (t-y) is delta. In fancier networks, delta and the change rule get more complicated, but the basic idea is the same. If output is wrong, look at the error and the units which contributed to the error and change their weights in the opposite direction, higher if output is too low, lower if output is too high. Learning rate isn't used for the perceptron because it uses integers.
The linear version is also too simple to work in the models we'll see, but the math of the learning rule is demonstrating using this version. In this case, we have targets, ouputs, errors and weights. How do we change the weights to reduce the error? A slide was put up which had a simple network with two outputs and three inputs. For the continuous case, you need partial derivatives because we're interested in how the error changes with respect to weights. The error formula (E) is 1/2 times the sum of the squared error. The partial derivative of E with respect to a particular weight, wji, is the partial derivative of E with respect to the output of y (yi) times the partial derivative of yi with respect to wij (by the chain rule). This gives us 1/2(-2)(ti-yi)yj = -delta*yj. The change rule involves changing the weight in the opposite direction, so the we remove the negative, and multiply delta*yj by the learning rate epsilon. These formulas are given in the McCleland and Rumelhart reading.
Back propagation in the sigmoid case with hidden layers is more complicated. A handout with the derivations for a single output node was given out. The basic idea is still the same, and the final formula is basically the same. There are two differences (1) we have a more complex output function (the sigmoid function) because we need a function which is bounded and with which we can get more computational power with more layers, which isn't possible with a linear output function. And using multiple layers requires computing the delta for internal nodes. But the formula at the end is actually of the same general form.
Back propagation is a local search, which can get caught in local minima. The current best way to solve this problem is by starting in lots of different places and taking the best result. In T-learn you can also set a momentum for the change rule, which does not help the local minimum problem, but does causesthe current change to take into account the previous change to some degree. This keeps the network from jumping up the sides of a trough in weightspace. It doesn't work well if it is set too high, such as at 1, but 0.5 to 0 to is okay to try.
The point of supervised learning is not necessarily that the brain learns this way, but that a connectionist system like the brain can learn things like grammar and concepts, so these don't need to be innate.
In reality, back prop networks are used for predicting the stock market, currency exchange rates,speech recognition, etc.. A slide was put up which show the flow chart of a speech recognition system which is being developed at ICSI. One part of the system uses a very big connectionist net which is trained over and over to learn how to recognize the most probable phone given some processed speech signal and neighboring inputs. These models are not terribly relevant to cognitive modeling, but they are being used in many practical situations.
Also, Regier's model uses back propagation, and we will discuss that in the
next
lectures.