Lectures 14-15. Regier's model for word learning. 
 

March 8, 10, 1999

Terry Regier, the author of the text, The Human Semantic Potential (HSP), was a student at Berkeley. While taking CogSci 101, he thought it would be possible to compute image schemas neurally. In the container schema, for instance, the boundary can be of any shape. How can we recognize these all as containers? Ramachandran of San Diego had done some experiments in which subjects were shown a green disk with a yellow dot. When the dot was placed so that it fell in the subject's blind spot, the subject saw an entirely green disk. The experimenters hypothesized that this is possible because we have a filling in algorithm which goes from outside to inside. If we have a filling in algorithm, then we can take a boundary and project it onto another map to do the filling in. With this mechanism, you can separate interior from exterior of a bounded region. This is part of Regier's model and part of the process of computing image schemas.

How do you do an orientation schema? There are orientation sensitive cells in the visual system which respond to particular orientations of edges, lines on maps. If there is a scene with a circle directly over a square, we have a very good example of 'above', but if the circle is somewhat to the side of the square, we have a less good example of 'above'. Also, there are difficulties if the shape of the landmark changes; for instance if the landmark is a triangle, the shape of the triangle has an effect on the possible position of the trajector which can be labeled above. Scenes which illustrate this problem are on p. 84 in HSP. The problem is how can we get a system to learn this? For his system Regier used a computation involving the closest perpendicular from the trajector to the landmark and the center of mass from the landmark to the trajector. The closer these two measurements come to verticle, the better example of 'above' we get. The details of the computation are explained in Ch. 5 or HSP.

Anne Herskowitz later noted that even this computation would fail if the trajector were moved away from the landmark on a path that kept the closest perpendicular and center of mass angles the same, it would eventually be too far away to be 'above' the landmark, but the system would still classify it as 'above'. Regier's latest answer to this problem involves vector sums. Experimentation involving the arm movements of monkeys has shown that there are topographic maps in the brain which indicate orientation, but there is no one neuron in these maps which would indicate which way the monkey was going to move its arm. Instead many cells fired for each arm movement. These orientation and direction cells were recorded and if activations were summed, then the result tells which way the monkey would move its arm. So a distributed property of the map corresponded the the action of the monkey. Regier has a new model that treats "above" as a distributed property, but this has not yet extended it to the rest of HSP.

Regier began his project with the mini-language acquisition problem. Given a scene with objects and a description of the object relations, build a system which can associate the scenes and the descriptions, and for any new scene-description pair, tell if the description acurately describes the scene. Also, the same model should do this for any language. Regier's model focuses on a more specific problem in that it doesn't use whole sentences as the descriptions, only key words. When children learn language they begin with single words, so there is a motivation for building models that learn single words. But it doesn't address the problem of learning syntax.

Regier's model does work cross-linguistically. The model learned spatial terms for English, German, Bengali, Mixtec, Japanese. This is only a few languages, but it covers a significant range of language families in the world, so it supports the claim that the model will work for spatial terms for any language, needing only slight modifications.

This can be done for languages which are very different. For instance, in Mixtec, the spatial terms are based on bodily projections to the landmark. A comparison of Mixtec and English terms is given on p. 23-24 of HSP. Also, languages as similar to English as German have different spatial terms; German auf is like English 'on' for horizontal surfaces, while an is for vertical sufaces. The model was able to learn these highly varied systems because it uses spatial primitives, such as contact, interior, etc. and combines them in various ways to match the terms of each different language.

If you can model the components of image schemas, then you can learn how they're put together in various languages. How can these associations be learned? Regier's model is partially structured and partially PDP. PDP alone will not work for this problem because the subproblem that back propogation nets are position dependent. They can't learn to detect a feature if it is in different places in different examples. Obviously, this is a problem for learning spatial features. The structured section is largely responsible for picking out elementary features of a scene such as containment, contact, interior, exterior, etc. The PDP section learns how the features combine in the lexical items of a language. It is trained in a way similar to the standard back-propogation method in which a training run consists of giving some input to the system and turning on one of the output nodes at the top.

The pictures used for the system are admittedly abstract and artificial. When people do this, they may project real world scenes onto the abstract scenes, which might influence their interpretations. The goal of models such as this is to provide an explanation which is neurally plausible and which may be part of the ultimate solution, but there is no claim that the brain learns spatial terms the way the model does. Even this simplified situation is very difficult to model. There are limitations in our knowledge about the brain, so further modifications await work being done in cognitive psychology, neuroscience, etc.

****cii slide*****

A slide was shown of the set of training examples used to train the system for Mixtec 'cii'. Each dot represents a separate data item for training. This model will train up on a couple hundred examples rather than thousands like some PDP net. The structured part of the network in part makes this possible. The lower pictures are the output. The size of the dots on the output diagram indicates the degree of activaiton of the 'cii' node if the trajector is in that location. A large dot indicates a high degree of activation, a small or no dot indicates a low or zero degree of activation. The diagrams indicate the receptive fields of the 'cii' node. The diagram also shows that there is still some error. The left pair of diagrams represent the training and output for 'cii' with a landmark which has a longer vertical than horizontal axis. In this case, it also has an orientation (indicated by the blue dots The model was trained on examples with the trajector on the left side of the landmark and tested on examples on the right side of the landmark. This illustrates that the model has learned that 'cii' doesn't mean left or right (which is based on the orientation of the observer), but it means 'along the long axis relative to the orientation of the landmark'. So the model has to learn terms which depend on different features, not just features of English, such as 'left' and 'right'.

Another interesting part of the model is that it learns with no explicit negatives. In language acquisition, children learn with very few corrections from adults. Figure 4.4 p. 64 in HSP shows two types of training sets for 'outside'. The first set (a) is ideal; the model is told that all the circles are outside and all the X's are not outside. But this includes explicit negatives and children don't use those to learn. The second set (b) is more realistic; the model is told that the circles are examples of 'outside' and the other examples are given other labels. There is a theory that in child language development that as children develop they go through a period in which they assume that a situation described in one way can't be described in any other way. This is not true, of course, but for spatial relations, it can be a helpful assumption. For this example, the model would assume that any examples labeled 'outside' were actually outside and all other examples were not outside. This could cause some problems because training examples labeled 'below' are also outside, even if they're not labeled 'outside'. So strict exclusivity won't work, but weak exclusivity will.

********Fixing the obvious solution slide************

The formulas for the error using implicit negatives are in section 4.5 of the text (pp. 66-69). The main idea is that the error for negatives is discounted by some amount Beta. Beta usually corresponds to the number of elements in the contrast set, but the model isn't very sensitive to changes in Beta. In the brain, there isn't back propagation, but there is some kind of error correction or feedback. It would make a lot of sense if it had a weak negativity priniciple. In any language learning, if we learn the name for something, that gives us some evidence that other objects are not also called by that name. The contrast set itself is in general determined by frame semantics. This model, for instance, works in the spatial domain. On p. 74 of HSP, figure 4.9 the results of training on 'outside' on all positives, strong implicit negatives and weak implicit negatives are given. It is clear that the the training on weak implicit negatives is the only one which works pretty well.

It is important to remember that the network is trained on lots of examples using different sizes and shapes of landmarks and trajectors. It has to learn the general properities of the spatial relations. In other works, the model doesn't learn a concept for 'outside' for a triangle, square, rectangle separately. It learns the concept in general and is tested on various shapes, some of the results of which are printed in the text.

****structured subnets slide*******

The basic architecture of the system is given in the figure on p. 12 in HSP. The input to the system are scenes with figures in some spatial relation. Each figure is previously labeled as a trajector or a landmark. One interesting project would be to modify the model to learn which figure is the trajector and which is the landmark given a linguistic cue with syntax which makes those roles explicit, for instance "circle above square". Another problem which hasn't been solved is how to go from individual word meanings to syntax.

The model has two major sections which do preprocessing. One is directional and the other non-directional. When preprocessing is invovled, the model has to be designed. For this model's design, Regier had three constraints. The structured section had to (1) be biologically plausible (2) reflect linguistic contraints (3) work within computational constraints. Some features of spatial terms, such as 'contact' are not dependent on direction or angle, but others, such as 'above', are. The model takes scenes as inputs, does the computations of the structured parts, feeds the results of those computations into the back prop net. In training, the weights of the back prop net are changed. The input to the back prop net should be thought of as features. The net learns which features are correlated with which spatial terms in each language. Actually, there is some training in the structured part as well. The theta-node layer is trained to chose which features are even sent to the upper layer.

The non-directional part of the model computes such features as 'inclusion'. See figure 5.17 in HSP, p. 99. A scene with a landmark and trajector are given as input and from that a landmark interior map and a trajector boundary map are computed. (There are neural mechanisms that do something like this.) There is another neural map above these which have a connectivity pattern which computes features through its connectivity with the other two maps. In the model, this is all hard wired. The feature map has cells, one for every point in the map, with center- surround receptive fields. There are cells in the brain with this kind of receptive field. The cells in the feature map receive input from the cells in the lower maps.

One cell in the feature map receives input from a set of cells in the landmark interior map which are in center-surround configuration. The center cell fires if it is in the interior of the land mark and sends excitatory activation to the feature map cell. The surround cells fire if they are in the exterior of the landmark and send inhibitory activation to the feature map cell. The surroud cells are weighted less that the center cell. The feature map cell also receives excitatory activation from a cell in the trajector boundary map. (These cells correspond to the same point on the input scene.) If the feature map cell receives activation from the trajectory boundary map and the right pattern from the landmark interior map (on center, off surround), it will fire, indicating inclusion. Because the feature map cell is gated by the trajector boundary cell, it will dectect the part of the trajector boundary which is inside the landmark interior.

On top of the feature map is a head node, which gathers information from the feature map and either takes the maximum activation value or the average. If it were taking the average for the example illustrated in figure 5.17, it would give you a measure of how good an 'inside' the example is. So for 'inside' we need an averaging grandmother node. For a concept like contact, one point of contact is all you need, so you need if you know that the maximum activation in the feature map is more than 0, you have contact. (Dectecting contact also requires a different kind of center-surround organization, see HSP, p. 100)

The models learns how much to weight the average and the maximum for a given feature. In the beginning, the model uses learning from the error in the PDP net to increase the weights on the type of head node which works best for each feature and decrease the weights of the other head node.The system also learns which weights on the center-surround cells are best. The formulas for learning in this part of the system are given below. The way to propogate the learning below the PDP part of the system is to have it go through the nodes which do the average or maximum calculation to the weights below which feed into those nodes. So for error passing through the averages node, the formula is divided by n, the number of nodes. For the maximum node, the function isn't differentiable, so Regier chose to change the weight which was the max, since most of the error would come from that.

The center-surround fields correspond to every location on the map, so there are many more than are shown in the figures. In the system, Regier used weight sharing for these cells, so that the weight pattern for all center-surround fields are always the same. This has no biological motivation; its a modeling convention which gets around the problem of position dependence in back prop nets. If the weights are the same for all locations, the feature will be detected wherever it shows up.

The directional part of the structured section computes relational orientations and referential orientations. The idea is that for certain calculations it is useful to compute things like proximal orientation and center of mass orientation used for 'above'. These features will differ for different terms and languages. In 'above' for instance, the reference orientation is upright vertical, which in Mixtec 'cii', the reference orientation is the major axis of the landmark. Regier's insight was that a lot of spatial relations have to do with compute the closeness of angle between what you're seeing and some reference. For instance, 'across' indicates movement in the direction of the minor axis, while 'along' indicates movement in the direction of the major axis.

The angles are computed from the same landmark and trajector boundary maps. The theta nodes compare the relational orientation to the reference orientation to see how close they are. This is treated as a feature represented by a theta node. The theta nodes have three paramenter, sine theta, cosine theta and the slop (sigma), how much room for movement there is. All of these are learned. Regier assumes that there is an innate notion of computing and comparing angles, but languages differ in what they use as a reference. In Mixtec, for instance, the body is the reference. As the system learns, it reduces error by changing its notion of what the appropriate reference is. The formulas used for these computations are given and discussed in HSP, section 5.2, pp. 89-93. Again, these parameters are learned by pushing the error from the PDP section back through the structured section.

One problem not addressed by this system is the effect of gravity. Figure 5.2, p. 84 in HSP show two scenes in which the exact same orientation of trajector and landmark have different values for 'above'. In the first scene, the trajector is above the landmark because of the effect of gravity. Verticality has to be computed relative to gravity, but it isn't in this model.

There is an extra section of the model used for motion, such as 'through' and 'into'. A figure of this section is given on p. 109 in HSP. Input for this kind of learning require 5 input scenes. In this section the features of the first and last scene are saved and it computes the maximum, minimum and average of the features over the 5 input scenes. These calculations are put into another back prop net and are used to learn, 'through', 'out of', etc. The max and min computations are useful for detecting the existence and degree of certain features over time. For 'through' the trajector has to start outside the landmark, go inside it and then go outside it. So the max value for the inclusion feature has to be high and also the min value for inclusion has to be 0. Any other results for these computations would not be a good example of 'through'.

The model has some limitations. (1) Its scale is very small. It only works with a few terms in a limited domain. The question is whether or not it would scale up. (2) Uniqueness/Plausibility: just because this model works, does that mean that it is the way the brain does it? (3) Grammar isn't addressed. (4) Abstract concepts aren't addressed. (5) It doesn't do inference. In fact, the learning it does is rather weak. It only labels a scene; it can't generate anything because it is not set up to do that. (6) The representation is a back propogation net, which is a very impoverished representation for our kind of computational models. We need a very different kind of connectionist model to model language more realistically.