Book Recollection: Speech and Vision: Different Roles

Speech and vision are the two principal ways we have used to interact with other people and the world around us for thousands of years. And since vision occupies so much more of the human brain than speech, we may be tempted to declare it the queen of human machine communication. That would be an easy – but deceptive – conclusion. Vision and speech do not serve the same natural roles in human communication.

Being Greek, I can still hold a “conversation” in Athens through a car window, using only gestures and grimaces – one clockwise rotation of the wrist means “how are you,” while an oscillating motion of the right hand around the index finger with a palm extended and sides of mouth drawn downward means “so, so.” A sign language like American Sign Language works even better. But when speech is possible, it invariably takes over as the preferred more.

If we take a closer look, we a puzzling asymmetry: We use speech equally for two-way communication. But vision is used mostly one-way – for taking in information – and only secondarily for generating visual cues that reinforce spoken communication. (Visual communication would have been a two-way proposition, too, if we were born built-in display monitors on our chests.) Why this difference? Perhaps the one-way power of vision was nature or God’s man-eating animals, useful and useless objects, lush valleys and dangerous ravines, where maximum “information in” was essential.

But then, why didn’t nature or God make speech just as powerfully a one-way capability as vision? I’ll venture that speaking and listening were meant for a different purpose – intercommunication, where, unlike survival, a two-way capability was essential. And since survival was more important than chatting, the lion’s share of the human brain was dedicated to seeing.

These conclusions run against the common wisdom, especially among technologists, that for human-machine communication, “vision is just like speech, only more powerful.” Not so. These two serve different roles in our natural selves, which we should imitate in human-centric computing. Spoken dialogue should be the primary approach for exchanges between people and machines, and vision should be the primary approach for human perception of information from the machine.

We can imagine situations where a visual human-machine dialogue would be preferable; for example, in learning by machine to ski or juggle. But in human-centered systems, we are interested in human-machine intercommunication across the full gamut of human interests, where, as telephone has demonstrated, speech-only exchanges go a long way. (Might these basic difference between speech and vision have contributed to the lack of success of various “video-phones”?)

If we can combine speech and vision in communicating with our machines, as we do in our interactions with other people, we’ll be even better off. Such a blending is beginning to happen in the research laboratories. But it’s not easy to do, since the technologies for speech and vision are in different stages of development. Nor is the obvious and natural wish to combine them in human-centered computers reason enough to ignore their different roles.

Excerpt from “The Unfinished Revolution” by Michael L. Dertouzos.

Book Recollection

12 October 2007

Speech and Vision: Different Roles

No comments:

Blog Archive

About Me