Leonhard Fessler / Version 0
Machine Listening
The engineer’s secret of success is to empathize with the machine. In an artificial intelligence related environment, the machine is supposed to do something that is considered “intelligent” as long as the same thing is done by humans. Empathizing with such a machine comes down to pretending to be stupid.
Suppose your machine is a speech recognition system: a listening machine, trying to convert spoken language to written text, mocking an audio typist. To understand such a machine, you have to get rid of the smart tricks the perception apparatus plays on you, and focus on the pure audio input it gets.
Humans perceive an utterance immediately as a string of words, associate meaning to it, grasp emotional content, they just understand. Of course, there are no such things to a machine, there is no meaning, no emotion, no words. There is just a stream of sound, interrupted by some noise, when the speaker takes a breath or hesitates. Imagine listening to a language you do not understand at all: you have no clue about word boundaries.
Interesting things happen phonetically when words are stringed together: if you hear the words’ cheap part spoken as a phrase, the sound will likely contain only one [p], although you think you hear two. What happened? One of the colliding [p]s is omitted. However, you can still audibly distinguish the phrase from cheap art. The delicate difference in sound turns out to be a short pause between chea(p) and part. A few milliseconds of silence may convey meaning.
Actually, this plosive consonant [p] is nothing else than a burst of noise energy, air pressure built up in your mouth, unloaded by opening lips, reverberating through the vocal tract and filtered by nasal and oral cavity. Its sound heavily depends on the following vowel: the tongue is already in the right place to produce that vowel, and the lips open accordingly, both movements affect the spectral filtering of this [p]. The sound of [p] in part differs from the [p] in piece, but humans will just perceive a [p], as long as their language does not require further distinction. This is why Chinese speakers confuse [l] and [r]: they do not hear a difference, because they do not need to.
So, the engineer tries to avoid these and other phenomena of human speech perception, puts sound out of its context, focuses on its texture only, stops thinking about what it means, why it is there or what made it. It gives him a narcotic calmness, a relaxing stupidity, a freedom of mind – and it helps to understand why the machine does not work.