This installment continues the exploration of the development of speech synthesis. So far I've investigated the invention of the Vocoder and how it was used in the SIGSALY program in WWII. In this episode I explore the other side of the speech synthesis coin, speech recognition. Without the ability for machines to recognize speech on the one hand and the ability to synthesize it on the other, the wunderkind of today's consumer electronics, Siri, Dragon and Alexa, would not be possible. With both in place humans can now speak, and sometimes yell with exasperation, to a wide range of interconnected devices and our smart phones and Echo Dots will speak back to us. As developments in Artificial Intelligence take off the little computer in your pocket soon speak up for itself and yell back.
In a way it could be said that speech recognition systems began in the 19th century when sound waves were first converted into electrical signals. By 1932 Harvey Fletcher was researching the science of speech perception at that temple of telecommunications, Bell Laboratories. His contributions in this area showed that the features of speech are spread over a wide frequency range. He also developed the articulation index to quantify the quality of a speech channel. Articulation indexes are used in measuring the effectiveness of hearing aids and in industrial settings. Harvey is credited with the invention of an early electronic hearing aid, and is notable for overseeing the creation of the first stereophonic recordings and live stereo sound transmissions, for which he was dubbed the "father of stereophonic sound".
Interest in speech recognition didn't end with Fletcher. In 1952, over half a century before Siri or Alexa could respond to a voiced question of where to find the best noodle shop in town (or when the end of the world will be), AUDREY was on the scene. She derived her name from her special power: Automatic Digit Recognition. She was a collection of circuits capable of perceiving numbers spoken into an ordinary telephone. Due to the technological limits of the time she could only recognize the spoken numbers of "0" through "9". When the digits were uttered into a mic on the handset AUDREY would respond by illuminating a corresponding bulb on the front panel of the device. It sounds simple, but this marvel was only achieved after overcoming steep technical hurtles.
S. Balashek, R. Biddulph, and K. H. Davis were the creators of AUDREY. One of the obstacles they faced was to craft a system capable of recognizing the same word when it is said with subtle variations. The spoken digit "7" for example, when said multiple times by even one person is subject to slight differences. Duration, intonation, quality, volume and timing all change the sound of the word with each individual utterance. To recognize speech amidst all these variables AUDREY focused on the sound parts within the words that have the most minimal variation. In this way the machine did not need to have an exactly spoken match. Roberto Pieraccini put it this way, saying there is less variety "across different repetitions of the same sounds and words than across different repetitions of different sounds and words."
The exact matches came from the part of speech known as formants. A formant is a harmonic of a note that is augmented by the resonance of the vocal tract when speaking or singing. The information that humans require to distinguish speech sounds can be represented in a spectrogram by peaks in the amplitude/frequency spectrum. AUDREY could locate the formant in the spectrum of each utterance and use that to make a match.
AUDREY also required that there be pauses between words. She couldn't isolate or separate individual words when said in a string. In addition designated talkers had to be assigned, talkers who could produce the specific formants, otherwise she might not recognize a digit. For each speaker the reference patterns of the formants drawn electronically and stored within her memory had to be fine tuned. Yet despite all the limitations around her use, the researchers proved that building a machine capable of recognizing human speech wasn't a pipe dream.
AUDREY was expensive because she was state of the art and all analog. The six-foot high relay rack she kept occupied with all her vacuum-tube circuitry required a lot of upkeep. And she drew a lot of power that really hiked up the electric bill. The invention never really went anywhere in terms of being used as a tool in Ma Bell's vast monopoly. It could have been used by toll operators or wealthy customers of the telephone to voice dial, but manual dialing was simple, fast, and cheap.
Creating a system that had uniform recognition of words as uttered by multiple people was a dream that had to be fulfilled by other researchers down the line. They built on the sweat equity and foundation of those who went before. The fact that a machine can be made to decipher strange human vocalizations at all is sheer wonder. While others may be fond of Siri, Dragon and Alexa it is AUDREY who will always remain in my heart.
The Voice in the Machine: Building Computers That Understand Speech by Roberto Roberto Pieraccini, MIT Press 2012