Lectures 16-17. Bailey Model for Learning Verbs of Hand-Actions 
 

March 15, 17, 1999

There was an overview of the NTL task, which is to model concepts as embodied and grounded. Regier's model of the last two lectures provides an embodied model for basic spatial concepts. His model ties the conceptual, linguistic level to the structured connectionist level. The next two systems which will be discussed introduce a new computational level. This is an abstraction from the connectionist level, but the abstraction is constrained by the requirement that the computational model be mappable to the structured connectionist level and ultimately to the neural level. The model discussed in these two lectures is David Bailey's model of hand-action verbs.

Bailey's model is done at the computational level because hand-action verbs are obviously more complex than basic spatial concepts; in addition, Bailey's model learns concepts and is not only able to recognize new examples, but is also able to execute the action associated with the verbs.

For the computational level, two basic constructs are used: feature structures and x- schemas. Feature structures are similar to the constructs used in computer science and linguistics; they are feature value pairings.

The main claim of Bailey's model is that the meaning of an action verb, such as push, is the action itself. A slide was shown of a study of 1.5 to 2.5 year old children's first words. The words include mostly names of objects but also some general action verbs such as sit, go, open, hit. There are also some of the spatial relations of Regier's model, such as up, in, out, down, on. This shows these models of learning using single verbs has a general correlate in children's language acquisition; it's not completely artificial. Also, children usually learn their first verbs as labels of their own actions. Like Regier's model, Bailey's model is also designed to work cross- linguistically. As expected, different languages will break up the domain of hand actions differently. Tamil 'pudi' refers to clutching, holding, restraining, and catching with high force. Spanish 'pulsar' refers to pressing with the index finger and 'presionar' refers to pressing with the palm. In Chinese and other languages, there is no word for 'drop'. There is a word for 'release', but it doesn't distinguish whether or not the objects are supported at the time of release.

****model architecture slide*****

The model has three basic kinds of structure. In the illustrations, the green hexagons represent actions; the red squares represent linking feature structures; and the blue circles represent actual word senses, which can be thought of as dictionary entries. For instance, the action (hexagonal, green) structure could represent the action of pushing; the lexical (blue, circle) structure could represent the lexical sense of push, and the linking feature structure (red, square) links the other substructures. The features are extracted from the X-schemas and represented in feature structure, when labeling happens, word senses are linked to properties of feature structures. For instance, for 'shove' the feature of force will have a higher value and the feature of continued contact will have a lower value because in shoving, the hand stays on the object for a shorter amount of time than for just pushing. The values of features are generally graded, rather than all or none. The structure of the model allows the features and values, which are used for learning the senses of words to be extracted from the features and values which are used for executing actions. These features aren't in some separate language system.

To build this model, first, a simple model of motor control is needed. Animal control systems have synergies, fixed routines for basic movements, such as grasping or walking. These are actually complicated muscular and neural systems. For the model, these subroutines are taken as motor control primitives. They are parameterized. For instance, they have a particular force, etc. Such parameters are perhaps not experimentally separable, but they are neurally separable and available to conscious attention. These are the parameters used for the language learning model, and the claim is that any parameters which are not accessible to conscious attention are not available to language.

Motor actions are computationally complex. They require concurrence, for instance. If you want to grab something, you have to move your arm and, at the same time, preshape your hand for the object you want to grab. Walking also requires several concurrent movements. Also motor actions involve synchrony. They can be interrupted by other systems. For instance, if you are walking and you lose your balance, the walking process is interrupted until you can regain your balance or fall down. Executing schemas or X-schemas are computational formalisms which can model motor actions with concurrency and asynchrony in a way that will also map to the connectionist and neural levels. This is the formalism used in the green, lower part of the model.

*******"slide" X-schema slide*******

The X-schema for the action 'slide' is given above. An X-schema is similar to a flow chart or a finite state machine in that activation flows through it, but it is able also to model parallel actions and asynchronous control. X-schemas are an extension of Petri nets (Karl Adam Petri) which are designed to allow coordination of actions which is not dependent on time. In the real world, complex events are made of other events which happen at different times. The complex event isn't dependent on the exact time that other events execute only on the fact that they execute at all. Motor control events have this same character. In Petri nets and X-schemas, this property is modeled with tokens, which represent the flow of activation. Neurally, this can be though of as flow of neural activity. When a token, which looks like a dot, appears in a state (drawn as a circle) on the X-schema, it enables the next transition (drawn as a square and labeled with an action) to fire. The transition uses up the tokens in the input states and gives tokens to the states which follow it (output states). A transition may require more than one token to fire. It may have more than one input state. Or the connection from an input state may have a number, such as 2, which indicates that the state has to have two tokens in it before the transition can fire. On the slide schema, the 'apply hand' transition requires two tokens. Also, the transition can deposit more than one token if the connection to an output state requires it. On the slide schema the connection from 'apply hand' indicates that two tokens will be deposited in the following state.

In the 'slide' example, activation moves from the start to a transition which marks the beginning of two parallel actions (noted on the slide with the || symbol). One action is moving the arm, the other is preshaping the hand. Notice that preshaping the hand depends on world knowledge about the object. If the object is small, you preshape your grasp, because you would grasp the object to slide it. If the object is large, you would preshape your palm, because you wouldn't be able to grasp it. When both of these actions are finished, there will be two tokens deposited in the following state which allows the 'apply hand' transition to fire. Tokens are deposited in the next state and the 'move arm' transition fires. Notice that in the 'move arm' transition, the feature structure information such as direction, force, etc. and involved. As activation moves through the next states and transitions, world information is used to see if the object is at the goal location. If so, a token in the 'at goal' state inhibits move arm and causes the final transition to fire into the 'done' state. If there is instead a token in the 'not at goal' state, the transition which fires causes the 'apply hand' and 'move arm' sequence to continue until the goal is reached. On the top of the model, there is a test for slippage. If slippage is detected, one response may be to tighten the grip. This is an example of the asynchrony in the model.

X-schemas are linked to feature structures (the red section of the model). If the model is obeying a command, the model gets the command 'push', it activates the slide schema, chooses a hand posture (palm or grasp), direction, acceleration, etc. The world state activates such features as goal. These are the parameters of the action. The form of the feature structure include the name of the feature (or slot names) and possible values of the feature. Given some linguistic input like "push". A parser, which is not part of this model, would fill in the appropriate features and the world state fills in other features, so that the parameters for the appropriate action are known and the action can be carried out.

In learning, children's language learning mechanisms have access to the parameters and record the values of them for the instances labeled by the mother. For instance, if the child performs an action which pushes away a small object, s/he will connect the force, direction, hand shape, etc. of that action with the linguistic label of the action given by the mother. Eventually, after some more of this kind of experience, the child will have a sense of 'push' which includes these values for these features. Thus, the blue section of the model, which illustrates the senses of words, looks a lot like the feature structures. The meaning of the word 'push' is given in the values it sets for the parameters. The values are not exact. They have probabilities. For instance, for one sense of 'push' which is for pushing against a wall, there is a 60% probability that you will use the palm posture and 30% that you will use the index finger posture, but a very low probability of using the grasp, as expected.

In the full view of the model, there are different values for the different features of 'push' and 'shove'. In particular, 'shove' shows no feature 'depressable'. The reason is that the model has learned that 'depressable' is irrelevant for shove. That action isn't used on depressable objects, while the action for 'push' is used sometimes on depressable objects, so the senses of 'push' have to include that feature. For each sense of 'push', though, 'depressable' has a different value because one of the senses is for pushing button, and the other is for pushing against the wall. These parameters can be overridden by other linguistic or world information. For instance, 'push' usually means the direction is away, but 'push left' can change that feature.

After a command is given, the information in the dictionary is sent to the red feature structure which activates the X-schema in the appropriate ways and the action is carried out.

For learning, the child carries out an action and the mother labels it. These features are accessible to the language learning mechanism which uses them to build the dictionary senses of the linguistic label, as shown in the model. The key is that the features used by the language learning mechanism to make dictionary senses come from the motor system and perception of the world state. The features do not come from the language, but from action itself. So the features of the linking feature structure are built in, in a sense. They will include things needed to drive the motor system, such as force, direction, duration and features which come from the perceived world, such as object size. The perceived world may be different from the actual world, but because the perceived world is the only thing accessible to any learning mechanism, that is the only way world state information can be obtained.

The learning cannot be done with back propagation because (1) this model needs to be able to have some sense for a word after just one example, (we know that children can do this) and (2) the model needs to not only learn, but also obey commands, which back prop systems can't do. The key idea for learning in this model is that X-schemas are too complex to learn, but features which govern their parameters are learnable. This is a scientific claim crucial to the model.

The model for hand-action verbs assumes a general process of learning, which goes roughly as follows. We know that children learn actions before they learn language to label those actions. The features used to drive the schemas are present when they execute and are further available to the learning mechanism. So we can assume that the feature values associated with an action are available during language learning. When the child performs the action, the mother labels it, and the child then knows that the label correlates with that action in that situation. The child has further experiences in which s/he performs some similar action in a similar situation and the mother uses the same label. The child eventually learns to associate the label with some range of feature values and types of situations. For instance, the first time a child pushes a block, s/he will use a particular handshape and will hear "push" from the mother. At another time, the child may push a cup and will use a different handshape. The mother will also label this action as push. So the child has to decide whether there are two different meanings for 'push', one used for blocks and the other for cups, or whether there is one meaning for 'push' in which the value of the feature 'hand shape' can vary. For this case, the child chooses the second option. But in some cases, s/he may decide that there are two sense for 'push'. For instance, if the child presses a button, s/he is using a different motor program (the depress schema) than that used for pushing a block (the slide schema). The mother still labels this action as 'push', but the actions are very different. So the child is likely to decide that there are two different senses of 'push', one used for objects which can be handled and one for depressable buttons. The child further learns that some features are irrelevant to some senses. So 'elbow extension' is important for pushing blocks, but not for pushing buttons. The child learns this by having several experiences with pushing buttons, which are all labeled 'push', but which have widely varying values for 'elbow extension'. So the child learns that pressing a button is 'push' whether the elbow is fully extended or completely bent.

Bailey's model does this kind of learning using a Bayesian learning algorithm, model merging. For each instance of an action and label, the model makes a new word sense with that label. The new word sense uses the feature values of the action as the basis for the possible ranges its feature values. It assumes that each parameter in the word sense has a range of possible values which are centered on the particular values of the one labeled action. If the model performs another action and that action gets the same label as the previous action, the model has determines whether the new word sense should be merged with the pre-existing word sense, or whether it ought to be kept as a separate sense for the word. In general, it chooses to merge with word senses which have similar sets of feature values. This approach allows one-trial learning, in which after only one instance there is some sense for the word.

When similar word senses are merged, the ranges of parameter values for the resulting sense are broadened to include the values from both 'experiences'. Fewer senses for the same word is desirable because you get better generalization from the model that way. In other words, if there are many narrow senses for the word 'push' which encode exactly the features values of each push example, then model is not likely to recognize a new push action as 'push'. So the senses should be able to generalize well enough to recognize these new instances. However, if the sense for 'push' is too general, it will label as 'push' instances which aren't. There is a trade- off between fit to the data and generalizability of the model. This problem is parallel to the fitting of a curve or a number of curves to a set of data points on a graph. The model uses Bayes' Rule to measure this trade-off and determine when two word senses should be merged. Each separate word sense can be viewed as a curve which fits a set of parameter values.

The basic learning algorithm for Bayesian model merging is (1) Create a new word sense for each training example, generalizing each feature slightly. (2) While there exists good candidate merge pairs: (a) choose the best merge pair . (b) replace with a new merged sense12. During merging, probability distributions are combined to form more general word senses.

***********a learning illustration slide***********

The slide above illustrates the learning algorithm. In ex1, the model does 'push' and records the feature structure for that action. That feature structure is used to build the first sense for push. Ex2 is another action which is labeled push and which has a very similar feature structure to Ex1. The model decides that the senses for Ex1 and Ex2 should be merged. You can see that the value for duration is generalized to match both examples. Ex3 has a feature structure which is fairly different from the previous two. The model decides that it should be a separate sense. Ex4 again has a feature structure similar to 1 and 2, so the model merges Ex4 with the previous sense. The value probabilities for 'posture' are generalized for all three instances. The 'duration' feature has been dropped out because it has had a different value in each instance. Thus, the model determines that 'duration' is an irrelevant feature for this sense of 'push'.

*********Rating of a model slide*******

The above slide shows how Bayes Rule is used by the model to determine the best set of word senses. The best model is the one with the maximum probability given the training data. This number we can get from the product of the probability of the model and the probability of the training data given the model. These terms indicate how good the model is in simplicity and how good the data is with respect to the model. The goodness of the model is biased towards simpler models. The term P(l|v, m) is determined by the probability of getting the data for a particular word sense weighted by the frequency of that sense. In other words, term refers to the probability of a particular set of data given the various models (Bayesian fitting).

This model was trained on 165 examples with English labels. (There were 41 instances of push, for which the model developed three separate senses.) This is small compared to the experience a child would have, but it is similar to a child's experience in that there are no iterations. The model arrived near the optimal number of word senses, 21. The optimal number of word senses is the point at which the model makes the fewest mistakes. Mistakes increase if there are more or fewer word senses. The testing results were also very good. The model was correct about 80% of the time. The missed responses were for situations which could be labeled with two words. For instance, 'push' is given when 'press' is expected.

The obeying mode was tested by giving the model a command and seeing which action it produced. Then asking it to label the action. It should give the same label for the action as the command.

The model has been tested on Farsi, Hebrew and Russian using the same parameters. These languages require the model to make distinctions different from those required in English. For instance, in Farsi, there are different verbs for applying pressure and for pushing away. It performed well on Farsi and Hebrew, but not on Russian, which includes information in the verb that the model is not equipped to handle. Thus, the model needs more structure to work for all languages.