Lecture 6. PDP and Structured Connectionist Models. Learning concepts. Recruitment learning.  

February 8, 1999

This lecture will focus on (1) designing networks for various tasks and (2) learning. Less is known about learning than other brain processes that we have discussed like vision and motor control.

There was a slide showing a cartoon of a neuron with several sites for accumulating inputs and three different methods of determining the internal activity of the neuron. When we design connectionist networks, we sometimes need something more complex for the unit function than the adding of inputs algebraically, so there are some other ways to calculate the input to a neuron. Biologically this makes sense because the issue about whether a neuron fires is decided not by the sum over the entire dendritic tree, but by relatively local computations. One equation relates the potential (internal level of activity) of the neuron to the maximum of pairs of inputs. In another equation, the potential if the disjunction of conjunctions of pairs (P <- OR( I1 AND I2, I3 AND I4)). Another form is used particularly for back propagation when you want this function to be differentiable. It is the sum of the products of sets of inputs. In the PDP literature, these are called Sigma Pi units, sigma for sum and pi for product. (P <- (I1 x I2) + (I3 x I4))

The next slide has an example of a model which uses these kinds of units which don't just add up their inputs. The slide has a toy network designed to simulate size constancy in the visual system. Size constancy is the phenomenon that we perceive objects as having the same size even though objects which are far away have much smaller size on the retina. The three elements of the system are the retinal size, the depth and the computed physical size. There's a three-way functional relationship between these elements. The brain adjusts for the depth in figuring out the physical size. There are mechanisms which estimate the retinal size and the depth. And also you can use knowledge of the physical size of objects to figure out the depth; for instance, if there is a person in a scene, your knowledge of the size of people will help you figure out the depth of the person and objects in the scene which are related to the person. In the model, there are units for depth, retinal size and physical size. The network has two-out-of -hree nodes, so that if the perceived depth is one and the retinal size is two, the physical size is two. And if the perceived depth is two and the retinal size is two, the physical size is four. There's mutual inhibition, so if the depth is one, it can't be two. The physical size computation is dependent on a particular pairing of retinal size and depth, so the connections from retinal size units and depth units are located at particular sites on the physical size units, so that the local computation of inputs as given in the equations above can be performed.

***red circle blue square slide***

Another slide illustrates the problem of distinguishing colored shapes such as red circle and blue square without getting confused. The slide has a model which uses the idea of conjunctive connections to solve the problem. There is a unit for "red circle" which has a connection site which corresponds to a particular position in the scene. At one position, such as 7, a unit designed to detect the color red fires. Another unit designed to detect a circle also fires for position 7. These units are linked to the position 7 site on the red circle unit, so the red circle unit fires. A similar story goes on for the blue square. The model doesn't activate blue circle, because the circle unit for position 7 fires and is connected to the position 7 site on blue circle and the blue unit for position 11 fires and is connected to the position 11 site on blue circle. Because the circle unit and the blue unit are not connected to the same site, even though they send activation to blue circle, blue circle doesn't fire. Also, red at 7 inhibits blue at 7, so again the blue circle node can't fire. Coincidence of features activating the same nodes at the same position should be thought of as electrical, they don't have physical proximity, necessarily.

Why can't you just have a neuron to represent every combination in every position? In the brain, there are about 10 to the 11th neurons and 10 to the 6th in the optic nerve. So if you wanted to have a neuron for every pair of shape and color, to have all pairs of a million items, you would need 10 to the twelve, which is more neurons than you have.

***triangle node slide***

Triangle nodes are a notational convention for representing the piece of circuitry shown in the slide. The three elements of a triangle node have the property such that if any two are activated, the third will also be activated. A lot of computations have this character that three things are interdependent. On the right is a model of how neurons might to do this. The small units are for the three elements, A, B and C. The large unit has a requirement that if two of the three input nodes are active, it sends input to all three of the small units, A, B and C. This is a common kind of computation, so we have the triangle node convention for it.

***ham and peas slide***

A slide was put up which gave an example of the triangle nodes in use, a simplified model of a routine for ordering wine depending on the main course. There is a routine for ordering the wine, which has a winner take all network. The model will order red wine when the food is salty. The other part of the network is a memory network which is used for making the decision about whether or not a particular food is salty. The notion is that foods have properties, for example ham has a taste, which is salty and has a color, which is pink. The memory network represents a little piece of knowledge. The knowledge is represented using triangle nodes. We would like to be able to say that the taste of ham is salty, so if ham and has taste are activated, salty is activated. In the process of ordering, we choose ham so that a binding mechanism binds ham to main course and ham is activated. The ordering wine routine requires knowing the taste of the main course, so that activates 'has taste'. When ham and 'has taste' are activated, then salty is activated. Salty is connected to the winner take all network which leads to the action of ordering red wine.

Because this is a spreading activation system, a lot of things are activated. How does a decision get made? There may be some activation of sweet from other things, but if a triangle node activates salty, it will have stronger and quicker activation than the activation of sweet, so it will win in the winner take all network and inhibit sweet, so that the decision can be made. In the routine, where a decision has to be made, there is competition, but in the memory network there will be residual activation of things that are related. We've seen the priming effects which are the result of this spreading activation. The weights on the connections may also differ. For instance, the link from ham to salty is probably much stronger than the link from salty to ham.

This network is simplified, so salty gives enough evidence to activate ham, but in more realistic, more complicated networks, salty would only give evidence that the food was ham. The architecture for such a node would still use triangle nodes, but it would be much more complicated.

The triangle nodes could have more than three elements, but in mathematics, there is a concept of binary relations R(x, y), where the relation is for example 'has taste' and the x, y are ham, salty. We know very little about how relational information like this is represented in the brain. There's no reason why you couldn't have relations with more elements, but, from a psychological point of view, an awful lot of knowledge does seem to be primarily represented in binary relations, such as hierarchies. There're also lots of binary relations in language.

The issue of local and distributed representations comes up a lot in connectionist modeling. A slide was shown in which the names of the Beatles are given local and distributed representations. In the local representation, there are four units, one unit for each Beatle. In the distributed representation, there are still four units, but the names are represented as a pattern of activation over the four units. With only four items to encode, each representation seems to work well, but if there were 1000 items, the localist distribution requires 1000 units. For a distributed representation, log base 2 of 1000 is required, or 10 units. So the distributed representation is much more efficient. If we had a million items to represent, the localist version would need 1 million units and the distributed would need 20. The distributed representation also degrades more gracefully. But the distributed representation can't represent a set of items because the two different sets might result in the same pattern of activation, so there would be cross-talk and confusion.

The brain uses a kind of coding which is a compromise between localist (punctate) and fully distributed (holographic) representation. This is called coarse coding. A slide was put up which showed a cross-section of the retina. The information collected by receptors is passed to the brain only through the ganglion cells, of which there are 100 times fewer than receptors. Each ganglion cells sends an aggregate signal from more than 100 receptors, so that the receptors giving input to each ganglion cell overlap. The result of this is that information is coarse coded. This is explained computationally in the next slide.

*** course coding slide 1st half***

If you wanted to represent a visual map with resolution of one square, you could have a coarse coding representation, using cells which respond to areas nine times the size of the small square. But the receptive areas of the large cells overlap, so that if three large squares are active, they represent the small area at which they all overlap. This allows you to encode the information with fewer cells. They are larger, but you only need three of them. The coarse-coding approach is now the approach taken by some biologists and psychologists in their research. Feldman's theory is that finer discrimination is developed when task demands cause you to allow more overlap in coding so there is finer-grained discrimination. In the brain, redundancy is built in; there are more neurons than minimally needed, so if one is lost, there is no serious loss of information. If five neurons are used where three are minimally adequate, there is only a very small chance that more than 2 will be lost naturally (not due to lesions).

More than coarse coding, the brain uses coarse/fine coding. There was a slide which showed that different parts of the brain are more and less sensitive to different part kinds of stimulus. For instance, the dorsal medial area is most sensitive to orientation tuning and not very sensitive to dimensional selectivity or directionality index. The dorsal lateral area is more responsive to dimensional selectivity than orientation tuning. The visual system does this overlapping, but the individual cells may have finer grained response to one stimulus and coarser grained response to another.

If the visual system had a unit to encode every combination of 10 values of orientation, direction of motion, speed, size, color and depth (which is considerably less than what the visual system can do), you would need a million units, and if you did that at each of the million locations in the optic nerve, you would again need 10 to the 12th units, which is more than we have in the brain.

*** coarse-fine coding slide***

Some cells are finely tuned for motion direction but broadly responsive to orientation and some cells are finely tuned to orientation and broadly responsive to motion direction. As the graph shows, the overlapping of cells codes the fine-grained discrimination in both dimensions. In the calculation with K dimensions and N steps/dimension, instead of needing N to the K units, you can reduce by a factor of D (how much coarse the response of the cells are in one dimension). So if one unit is D times coarser you only need N to the (K-1) units. So if you have 100 steps and five dimensions and D = 5, then coarse/fine coding only requires 10 to the 7th units per position, which is a feasible number, whereas localist representations require 10 to the 10th, which is unfeasible.

A problem with coarse-fine coding is that if there are two sets of overlapping cells, you can get ghosts. There are some inputs which will activate extra representations. For instance, if the representation for the size and orientation of Y is close to the representation of the size and orientation of X, a ghost which is the representation for the size of X and the orientation of Y might also be activated. Psychologically, there is a phenomenon like this called illusory conjunctions. If you quickly flash a lot of Qs and Ps in random positions, people watching will report seeing Qs and Ps and Rs. People don't have time to sort out the different letters, so the ghost representation of an R (a P with the tail of the Q) is perceived.

For most purposes, you can ignore the overlapping in models and have units which refer to the collection of activation which represents a particular feature, but that is a decision based on what you want to study in the model.

Learning
Much less known about learning in the brain than about the visual and auditory systems. There are three different kinds of learning and memory. Short term memory is actually electrical, activation as seen in the Necker cube, etc. We know this because it's so fast that it can't be anything else but an activation circuit with mutually excitatory connections. This is non-controversial. Permanent or long-term memory is a structural change in your brain. There are changes in synaptic strength. It requires protein synthesis and takes hours to occur. Retrograde amnesia is a kind of evidence which supports the difference between short-term and permanent memory. In between, short-term and long-term, there are some changes which are internal to the neuron, but don't require protein synthesis. This is an intermediate memory. Long term potentiation (LTP) is an example of this. In LTP a neuron is activated in such a way that it becomes more sensitive to the same kind of firing for a long while (weeks). This is not the same as permanent learning, but these types of memories occur in sequences. The opposite of LTP is LTD, long term depression. These three types should be considered stages of learning and memory which occur all over the brain. They don't occur only in particular parts of the brain.

There was a slide showing a diagram of a neuron and explaining 10 stages which the neuron goes through from short term to permanent memory. The stages are (1) changes in the permeability of ion channels. This happens relatively quickly (2) modification of the vesicular transmitter release, an internal chemical change (3) autocatalytic formation of further second messenger molecules. Chemical reactions release molecules which cause other reactions, even involving cell RNA (4) activation of regulatory factors (5) which intervene in gene regulation. The way the genes are expressed is controlled by learning. (6) new effector and regulator proteins are synthesized, initiating more long-term processes, including mobilization of synaptic vesicles. (7) formation of further, new active zones, (8) new ion channels into the membrane (9) morphological expansion of the synaptic terminal (10) long-term regulation of differentiation genes, which affects the behavior of the cell. This doesn't include short term memory which doesn't usually count as learning if nothing other than the electrical processes go on.

There are five kinds of memory modeling. Hebbian learning is a favorite of biologists. Hebbs idea is coincidence learning, if a cell fires and the post synaptic cell is also active, the synapse gets stronger. This is a biological fact, but it isn't the whole story because if the coincidence is bad for the organism, there needs to be some sort of feedback to keep the same firing from occurring again. This kind of feedback is involved in reinforcement learning. This is an interest of conventional AI modelers, but the interesting case is delayed reward. For example, in a game of chess, the reward is not given until the end of a long chain of moves; how do you decide which moves were right. Recruitment learning is intended to model one-trial learning. For instance, if Jerry tells us where he was born, we all know it suddenly after being told just once. This is a kind of learning which we all do, and seems instantaneous. Skill learning, however, takes a very long time. Back prop is a kind of supervised learning; for each step there is a coach giving feedback. This kind of learning should be faster than reinforcement learning, but it's not very biological. A final kind of learning is unsupervised learning, which concerns how the brain categorizes information when it is not clear how that information may be learned. Data is organized based on similarity. For instance, if you have to organize pieces of carpet, you may do it by similarity in color, size, texture, etc.

***ltp slide***

LTP and LTD occur mainly in the hippocampus, which is essential for forming new memories, but not necessarily used for older memories. The upper graph on the slide shows first the control amount of activity and cell has for a given amount of input, then the amount of increased activity from having tetanus introduced to the cell. The amount of firing for the same stimulus goes up and stays up for a couple of hours. This is long term potentiation. The lower graph is model for long term potentiation. The connections with smaller triangles are weaker than the connections with larger triangles. When weaker triangles are active, it is not enough to make the cell go into a tetanic state, but when the stronger connections are active, the cell goes into the tetanic state, and afterwards, the strong or weak connections can make the cell give the higher response, so LTP affects all the connections of the cell.

There was another slide which showed that if the presynaptic cell is active and the postsynaptic cell is inactive, there is no change. If the presynaptic cell is inactive and the postsynaptic is active, there the connection is weakened. This is the depression. If the pre- and postsynaptic cells are active, the connection is strengthened. But there are also coincidences of firing which can get encoded. If inputs 1 and 2 are both active and the postsynaptic cell is active, then both synapses are strengthened. If one input is active and the other is not, there is competition so that one gets more active and the other gets less active. So things which coincide mutually strengthen each other, but if two things don't occur together, one gets weakened. This is how the fine-tuning of development occurs. There still has to be some feedback mechanism as well.