Lecture 25. Extending the NTL approach.  
 

April 21, 1999

This lecture will focus on how far the NTL approach to language is likely to get us. From the class evaluations, one comment was that there has been no discussion of other theories about how language is realized by the brain. There really aren't any at the same level of specificity. There was a quote by Alan Turing (1948) put on the screen:

"Of the above possible fields the learning of languages would be the most impressive, since it is the most human of these activities. This field seems however to depend rather too much on sense organs and locomotion to be feasible."

That was the earliest story about the need for embodiment in a scientific theory of how the brain handles language. There has been very little since. The field of linguistics is dominated by Noam Chomsky whose view on the matter is given in the quote below from 1993:

"In fact the belief that neurophysiology is even relevant to the functioning of the mind is just a hypothesis. Who knows if we're looking at the right aspects of the brain at all. Maybe there are other aspects of the brain that nobody has even dreamt of looking at yet. That's often happened in the history of science. When people say the mental is the neurophysiological at a higher level, they're being radically unscientific. We know a lot about the mental from a scientific point of view. We have explanatory theories that account for a lot of things. The belief that neurophysiology is implicated in these things could be true, but we have very little evidence for it. So, it's just a kind of hope; look around and you see neurons; maybe they're implicated."

The next two lectures will be driven by students' questions, so we can discuss why most linguistics don't study language in an embodied context, , or any other questions which have come up during the year.

Today's lecture is about extending the NTL approach as far as it currently can go. The quest is to build a system which can learn all of a child's early vocabulary, the first 50 words or so. What kind of a model would be needed to do all of that stuff and how could it be built computationally? A slide was put up which showed the results of a study by Katherine Nelson (1973). It had three sets of words which one set of children learned. The data is given for three stages, an 8-word, 30-word and 50-word stage. There are lots of these kinds of studies and they show that the actual vocabularies vary greatly from child to child, even if the children in the study spend a lot of time together. Still, the early vocabularies seem to include similar kinds of words. For this particular study, the 50-word stage includes the following words: Ma, Patty, Daddy, I, Tommy, Baby, Bob, Me, Lassie, Kitty, Lion, Bear, Gun, Eye, Mouth, Nose, Cup, Milk, Snow, Car, Fire, Zauhn, Here, See, On, Throw, Go, Go home, Sit, Bump, Eat, Boom, Hurt, Hi, Bye bye, No, Yeah, Don't know, My, Meow, Moo, Bow wow, Hot.

Given the kinds of systems which have been introduced in this class, which of these words could currently be learned by such systems. Which words are different enough to need new computational or linguistic ideas to build models which would learn these words from examples. One problem with a lot of the early child vocabulary data is that the contexts in which the words were used is rarely included in the write-up of the study. Often children use words in contexts which are very different from the contexts in which adults use the same words. There can't be a learning model without context, so this is a problem for the enterprise.

Another study by Lois Bloom is given above. This study was done with a group of 14 children in the same play room. The words listed are those used by 7-14 of the children in the play room. For instance, all 14 kids used "baby", only 7 used "box." How would we build a system like the ones discussed so far which could learn all of these words from the appropriate examples in an appropriate context. Getting the appropriate contexts and examples is a project in itself, but beyond that what would a system which could learn these words look like.

>From the list, the nouns or thing-words are eliminated. Learning nouns is relatively simple in that the child only needs to see an object and then relate the label to it. But computationally, the problem involves computer vision, which hasn't been solved and is a separate project from the work done at the NTL group. People at Berkeley who do computer vision are Forsythe, Malik and Feldman (to some extent). It's possible that given the context, we would find that these thing words are actually used by the kids to mean actions or communicative acts. For instance, the child may be using "ball" for "throw," but without context, we will assume that the words simply denote things. Also, words which simply label sounds will be eliminated from the list because they are also just a matter of labeling a word and a perception, in this case a sound perception.

Further, the words which might plausibly have been handled by the Bailey system for verb labeling or the spatial relations labeling system of Regier's can be eliminated. The Bailey system could probably learn words such as go, sit, (maybe) get, (maybe) open. There is always the possibility that these words are used by children in much more complex ways that we expect, but in the simplest sense, the Bailey system could handle these worlds. The Regier system could handle in, out, on, up and down, etc.

The perspective problem comes up at this point. The child uses words such as 'up and 'down' as requests to be picked up and put down. Regier's system doesn't handle this. There are three perspectives which have to be accounted for even in the first 50 words. The Agent perspective is the experience of the doer; the child pushes something or puts something in his/her mouth. The Experiencer has the perspective of having things done to him/her. So the child is pushed or is put in the bath. The observer has the perspective of perceiving things done to some other thing or person. The child sees someone get pushed or see the milk put in the refrigerator. The experiences of pushing, being pushed and seeing someone else get pushed are very different, but the child has to learn that these are all instances of pushing. Likewise, the child has to learn the general concept of in from the three different perspectives. This is a complicated problem which has to be solved. The embodiment of each of these perspectives is quite different, but has to be put together at some point. This has to be addressed in order to build a system which can learn a child's early vocabulary.

The words in the list which haven't been eliminated yet are very complicated. The simpler of these are those which simply label emotions, such as uhoh, whee, yum and oh. If there was some way to have the internal state of a model represent emotions and convey that to some labeler, then these words simply label those internal states. The claim is that all of these early words label something in the kid's early experience, so relating 'yum' to the experience of eating something delicious is not very complicated. Doing this for words such as 'there' or 'more' seems much more difficult.

To some extent you could argue that 'hi', 'bye', 'yes' and 'no' are direct labels of a kid's experience. Children learn to say 'hi' and 'bye' through gestures long before they can talk, so the move from learning the communication act to learning the label may be relatively straightforward. 'No' is minimally (and probably first learned as) a refusal act, but it may develop other purposes as the child develops. However, modeling the concept of refusal is not so easy.

The word 'two' is another issue. There is some evidence that there is a neural ability to automatically discriminate numbers of items up to about 6 or 7. Animals can do this, too. This is called subitizing. So 'two' may just be a label for this kind of existing ability.

The other words, such as 'here', 'there', 'this' and 'that' do not obviously label such innate abilities. They involve spatial concepts relative to the self and they are also communication terms. This is also true of 'hi', 'bye', 'more', 'no more', etc. At an early age, the child is using words to try to effect the actions of the adult, to get something that they want. In other words, if the child says 'there', s/he is trying to tell the hearer where to direct attention. In some sense, a lot of these words on the list can be used this way, but the words above are minimally used this way. In order to handle these words, the child has to have some kind of model of the other person's actions or mind. You need some internal model of the other person. When a child says 'more', s/he makes a speech act because s/he believes it will cause an adult to do some action which will satisfy the child's desire.

In order to handle this we need an internal model in the child's head for what other people do; we need some model of the child's theory of mind. Animals don't necessarily have a theory of mind, but people absolutely do and there is some evidence that it starts before two years old. It is possible that some of the actions associated with this word could simply be associations without an accompanying theory of mind. For instance, if a child says 'up' as a request to be picked up, then the kid may have simply correlated the word with the action, but words such as 'here' and 'there' which request the other person to direct attention in a particular way are not well explained as simple correlations. Some theory of the other person and the other person's mind seem to be required.

The proposal for doing this kind of internal simulation involves X-schemas. Think about the Narayanan system in which there was a source domain which had a model of walking and stumbling. The source domain is a real-time inferential structure which is claimed to be used to understand abstract stories. The X-schemas could drive the muscles or could just be used to simulate what would happen in some event. To understand the embodied meaning of abstract terms you need the ability to simulate your own or other actions. In Narayanan's system, the f-struct was updated as a result of the X-schema executing.

If X-schemas can simulate your own actions, then it seems likely that they may be used to simulate the actions of others. The drop schema above shows how X-schemas might be used to model the actions of the world. We are able to anticipate the physical results of actions. The idea is that there is a world state, as there was in Narayanan's system, and there is the X-schema for dropping. If you drop something, you personally only let go, then gravity takes over. To model gravity, you have to understand the world's actions. An X-schema such as Fall would model the action of gravity. And the anticipation that a dropped object will fall can be done through X-schema simulation. This shows up in language everywhere: "the pull of gravity," the "blowing of the wind," etc. So this ability to model the actions of the world is basic and probably an ability kids have even at the early age of first words.

If X-schemas can be used to model your own actions and the actions of the physical world, it isn't such a big leap to have the ability to model other people, other things which are like you. In this slide above, there is an f-struct representing the child's inner state and an X-schema representing that the child is eating and is full and wants to stop eating. The child may also have a belief that the mother has a general controller schema just like the child and that the mother will continue feeding until she gets a signal to stop. Much earlier than language, kids are able to reject food using certain gestures. Saying 'no more' is a verbal refusal gesture, but in order to use it, the child has to have the model of self, the model of the mother and a speech act model which represents the knowledge that the words 'no more' will send the right signal to the mother and get her to stop feeding. A speech act is not qualitatively different that any other act, but for speech acts you need some model of communicating with another being which is like you. An important part of this is that you basically think of other people as projections of yourself. Also there is the notion that the X-schemas can execute with respect to different feature structures, for instance, with respect to the muscles, an abstract domain (such as economics), the physical world, or another person and their beliefs. One of the nice things about X-schemas is that they can be used either in action or recognition mode. If an action is represented as an X-schema, then, given an action, a program can match it to the best-fitting stored X-schema.

How many of the words from the list could be done if you have the notion of simulating the other person and developing communication acts which can become speech acts by the time the child is about 2 years old. The claim is that all of the words could be done this way.

For the word 'there' the child tries to get the mother's attention and get her to look at what the child is looking at. Pointing is one way to do this and children learn to do this. Saying 'there' is a speech act which is intended to do the same thing. All of these first words start with some primitive purpose which the kid wants to accomplish and the kid learns some gestural or physical way of doing that. And the kid eventually finds a verbal way to do it or help do it.

The slide above shows how it may be possible to combine Regier's story with the Bailey and Narayanan story. The left of the slide has the components of Bailey's network. Regier's network detects external features, such as seeing things in contact. Such features as inclusion, crossing, etc. would be detected by the visual system and provide additional features to verb-learn. The critical point is that you need to do imagery, to imagine what something would be like. That would have to fit in the X-schema situation in such a way that is would build the mental image and then the visual system perceives the mental image. That is in fact what happens based on fMRI studies, etc. The next step in the research is to build a system which has all of these things integrated.

Any action requires a model of the object being acted on. For instance, when you pick up a child, you have to have a good model of the child in order to pick it up without hurting or dropping it or throwing it in the air. You always have to have a model of the other, whether the other is a thing, a person, etc. The claim is that the same X-schemas which can give you a model of moving your arm, etc. can be used to be a model of the other that's interactive. If you have this you can explain such things as why all languages have passive sentences. "Harry picked up the child" and "the child was picked up by Harry" have different meanings. In the passive sense you focus on the model of the child. You can see this in American Sign Language which has one sign for the active and one sign for the passive which takes the perspective of the child. In Nicaraguan Sign Language, every transitive verb has to portray both the action of the actor and the experience of the patient. The idea is that you are always modeling the other interactively. The beauty of the X-schemas is that we can begin to see what such a model might be.

Question about temporal binding from 4-29 lecture: How does temporal binding create the perception that objects have color, for instance that a chair is blue. The answer is that it doesn't. When you see a blue chair, one of the things which is happening is that certain neurons are firing in sync, but the theory doesn't tell you how you get the conscious experience that these features of chair and blue belong together. A theory of consciousness is needed for this, and there is no current neural theory for consciousness.