Linear Predictive Coding
The elements of Linear Predictive Coding (LPC) were built on the basis of some of Norbert Wiener’s work from the 1940’s when he developed a mathematical theory for calculating the optimal filters for finding signals in noise. Claude Shannon quickly followed Wiener with his breakthrough work A Mathematical Theory of Communication, that included a general theory of coding. [For more on Wiener and Shannon see Chapter 3.] With new mathematical tools in hand, researchers started exploring predictive coding. Linear Prediction is a form of signal estimation and it was soon applied to speech analysis.
In signal processing, communications and related fields the term “coding” generally means putting a signal into a format where it will be easier to handle a given task. A coding scheme, like morse code for instance, is when an encoder takes the signal and puts into a new format. The decoder takes it out of its new format and puts it back into the old one.
The “predictive” aspect of coding has been used for in numerous scientific theories and engineering techniques. What they have in common is that they predict future observations based on past observations. Joined together the term “predictive coding” was coined by information theorist Peter Elias in 1955 in his two papers on the subject.
In LPC samples from a signal are predicted using a linear function from previous samples. In math a linear function that has either one or two variables without exponents or it is a function that graphs to the straight line. The error between a predicted sample and the actual sample is also transmitted along with the coefficients. This works with speech because the samples from nearby correspond to each other to a high degree. The error is also transmitted because if the prediction is good the error will be small and take up less bandwidth. In this sense, LPC becomes a type of compression based on source codes.
Towards the end of the 1960’s Fumitada Itakura, and Bishnu S. Atal and Manfred Schroeder independently discovered, as in the case of the telegraph and telephone, the elements of LPC. Later, Paul Lansky applied it making delightful music exploring the spectrum between music and speech.
Fumitada Itakura was interested in math and radio from an early age, and he had been an amateur radio operator in his youth. His elementary school happened to be just a mile from the radio laboratory at Nagoya University where his father knew some of the professors, so he had occasion to visit it and ask questions.
As an undergraduate he became interested in the theoretical side of math and started to learn about stochastic processes. As he extended his ability ever further, he eventually became involved in the mathematical aspect of signal processing. His research paper for his bachelor in electrical communication was on the statistical analysis of whistlers, a very low frequency electromagentic radio wave produced by lightning, and capable of being heard as audio on radio receivers. To study it he built a bank of analog filters to do the signal processing, and made digital circuits to try and find patterns in the time-frequency of the whistlers. It wasn’t easy work, but he persevered. In analyzing the whistler signal he had to work on filtering out a lot of the other noisy material that comes in from the magneto-ionosphere. The work required him to use band-pass filters and the sound spectrogram that had originally been designed for speech analysis.
This eventually led to further work with statistics and audio. When he went to graduate school he studied applied mathematics under Professor Kanehisa Udagawa. At Udagawa’s lab he became a part of a group studying pattern recognition and he started a project to recognize hand written characters in 1963. When professor Udagawa died of a heart attack he had to find someone else to study under to continue his course. This led him to work at the NTT.
Dr. Shuzo Saito had been a graduate of Nagoya University and was looking for someone to work with in speech research. Saito’s friend professor Teruo Fukumura suggested Itakura. Saito had an interest in speech recognition and encouraged Itakura to get involved. Fukumura began teaching him the basic principles of speech using using Gunnar Fant's Acoustic Theory of Speech Production. Itakura started making sound spectrograms of his voice speaking vowels. His voice was high and husky so it didn’t make as clean of a spectrogram as it would have with someone who had a regular voice. In this there was a hidden gift. He realized if they could do good analysis on a signal that had more random characteristics, they could do even better when analyzing regular speech. From this point, he went and applied statistics to speech classification, based on a paper he had read by J. Hajek. Reading math papers had been a hobby of his and it led to his work on Linear Predictive Coding.
Dr. Saito suggested to Itakura that he look for practical results based on his theory, so he started working with a vocoder and got some initial results on his idea, and wanted to go further. Dr. Saito suggested he look at pitch detection, as vocoders often had trouble recognizing voices because of their poor ability in this area. He conceived of a new method of pitch detection that used an inverse filter and oscillation. From this he proposed integrating the linear predictive analysis with his new pitch detection method to create a new vocoder system. In late 1967 he succeeded in synthesizing speech from the vocoder and brought the results to Dr. Saito. From then on Itakura has worked on vocoding.
Of the many modes in which speech is produced, the way vowels sound is very important, as it relies on the periodic opening and closing of the vocal cords. Air from the lungs gets converted by them into a wideband signal filled with harmonics containing many properties. This signal resonates the vocal cavities before leaving the mouth where the final sounds are shaped.
This speech signal gets analyzed, the signal of the formants estimated and removed in a process called inverse filtering. The rest of the remaining sounds, called the buzz, are also estimated. The signal that remains after the buzz is subtracted is called the residue. Numbers which represent the formants, the buzz and the residue, can be stored or transmitted elsewhere. The speech is then synthesized through a reversal of the original stripping process. The parameters of the buzz and residue are used to create a signal, and the information stripped from the formants is recreated to create a new filter. The process is done in short chunks of time.
Taking speech apart and putting it together on the other end was a huge technical feat that saves tons of bandwidth. Speech synthesis could fit five calls onto the same channel that regular voice took up with one.
Mafred Schroeder and Bishnu S. Atal
At Bell Labs he met up with Manfred Schroeder who had come from Germany. Schroeder was born in 1926 and came of age during WWII. During the war Schroeder had built a secret radio transmitter that spooked his parents. Transmitting radio was risky business because it was the province of spies and people who wanted to communicate outside the country. When Schroeder saw members of the army or SS outside his house with radio direction finding equipment, he shut off the transmitter for a month. He also listened to the BBC for news, and the American Forces Network transmitting from England, then illegal to listen to. Many people had been sent to concentration camps just for listening to foreign stations, and spreading news to others. The Nazi powers attempted to keep tight control on all information going in and out of the country. A special radio was even manufactured by the state, the People's Radio or Volksempfänger, that was built in such a way that it only could receive approved German stations whose programs were under the directorship of Joseph Goebbels.
He excelled at school and was often ahead of even the teachers, and during the war was drafted to a radar team to track incoming aircraft flights and do other work, where he gained extensive experience with the technology.
Schroeder was also a math fanatic, like Itakura was, and when he did go to university, always took extra math classes on the side of his physics work. He had been fascinated by crypto math and he loaded up on function theory and probability classes. Eventually Schroeder got a job offer from Bell Labs in 1954, based on previous work he had done experimenting with microwaves and he emigrated to the United States.
Bell Labs wanted him to continue his research with microwaves, but he thought he’d switch gears and get into the study of speech instead. For two years he worked on speech synthesizers, and didn’t have much luck in getting them to sound good, so then turned his attention to speakers and room acoustics. Many researchers who were following the dictates of their own curiosity and inclination were left alone to pursue their studies, and see what came out of them and where it took them.
John Peirce at Bell Labs wanted Schroeder to use Dudley’s vocoding principles to send high fidelity voice calls over the phone system. This caused Schroeder to hit up against the same issue as Itakura had, the problem of pitch. Part of the issue was extracting the fundamental frequencies from telephone lines not known for superb sound quality. As Schroeder investigated he realized he could take the baseband signal, or those frequencies that have not been modulated, and distort it non-linearly to generate frequencies that the vocoder would then give the right amplitude. This ended being a success. This became voice excited vocoding and the speech that came out of the other end was the most human sounding of any speech synthesis up to that point.
In 1961 Schroeder hired Dr. Bishnu S. Atal to work with him at Bell Labs. Atal was born in 1933 in Kanpur, Uttar Pradesh, India. He studied physics at the University of Lucknow and received his degree in electrical communications engineering from the Institute of Science in Bangalore, India in 1955, before coming to America to study for his Ph.D at the Brooklyn Polytechnic Institute. He returned to his home country to lecture on acoustics from 1957 to 1960 before he was lured back to the U.S. by Schroeder to join him in his investigations in speech and acoustics.
In 1967 Schroeder was pacing around the Lab with Atal, and they were conversing about needing to do more with vocoder speech quality. His work on pitch had improved the quality of vocoding, but it wasn’t yet what it could be. What they needed to do, they realized as they talked, was to code speech so no errors were present. As they talked the idea of predictive coding came up.
They realized that as speech became encoded they could predict the next samples of speech based on what had just come before. The prediction would be compared with the actual speech. Alongside this the errors, or residuals, would be transmitted. In decoding the same algorithm was used to reconstruct the speech on the other end of the transmission. Schroeder and Atal called this adaptive predictive coding, with the name later changed to linear predictive coding. The quality of speech was as good as that which came out of his voice excited vocoder. They wrote a paper on the subject for the Bell System Journal and presented on it at a conference in 1967, the same year Itakura succeeded with his technique.
Since 1970's most of the technology around speech synthesis and coding has been focused on LPC and it is now the most widely used form. When it first came out the NSA were among the first to get their paws on it because LPC can be used for secure wireless with a digitized and encrypted voice sent over a narrow channel. The early example of this is Navajo I, a telephone built into a briefcase to be used by government agents. About 110 of these were produced in the early 1980s. Several other vocoder systems were used by the NSA for the purpose of encryption.
LPC has become essential for cellphones, and is a Global System for Mobile Communications (GSM) standard protocol for cellular networks. GSM uses a variety of voice codecs that implement the technology to put 3.1 kHz of audio into 6.5 and 13 kbit/s of transmission. LPC is also used in Voice Over IP, or VoIP, such as is used on Skype and Zoom calls and meetings.
A 10th order derivative of LPC was used in the popular 1980s Speak & Spell educational toy. These became popular to hack by experimental musicians in a process known as circuit bending, where the toy is taken apart and the connections re-soldered to make sounds not originally intended by the manufactures. [For more on Ghazala and circuit bending, see chapter 7.]
Vocoding technology is also utilized in the Digital Mobile Radio (DMR) units that are currently gaining popularity among hams around the world. DMR is an open digital mobile radio standard. DMR radios use a proprietary AMBE+2 vocoder that works with multi-band excitation for its speech coding and compression to achieve a 6.2 kHz bandwidth. Again the compression and the digital codecs often result in sound artifacts and glitching to occur while talking. Besides it's use in DMR the AMBE+2 is also used in D-Star, Iridium satellite telephone systems, and OpenSky trunked radio systems.
Paule Lansky: notjustmoreidlechatter
Since LPC allows for the separation of pitch and speed and the pitch contours of the speech can be altered independently of the speed, it can also be used by the creative thinker for musical composition. Paul Lansky was one such thinker and he used LPC to great effect in a series of compositions exploring synthesis and the qualities of speech.
Paul Lansky was born in 1944 in New York and counted George Perle and Milton Babbit as among his teachers. Lansky got his Ph.D in music from Princeton in 1973. Like many others of his generation, Lansky started off being schooled in the school of serialism. His teacher Perle had developed an iconoclastic twelve tone modal system, and Lansky used this to write a piece. For his dissertation he continued to explore Perle’s methodology and used linear algebra as a way to create a model of his teachers system. His interest then extended to take in electronics and computers as a way of exploring the mathematical possibilities inherent within serialism.
His first foray into electronic composition was on Mild und Leise from 1973. Proper old school, it was composed using a series of punch cards. Learning the mechanics of the system to achieve his desired outcome was as much a part of the procedure as the composition. For it he used the he Music360 computer language written by Barry Vercoe on an IBM 360/91. The output from the computer went to a 1600 BPI digital tape which that had to be carried over to a basement lab in the engineering quadrangle at Princeton to listen to. It used FM synthesis which had just been worked out at Stanford [for FM Synthesis see Chapter 4.] The harmonic language came from Perle’s system. The result is very emotionally resonant pure electronic music. Lansky has ever been keen to foreground the music in front of the technology used to make the music, and that is true here. The piece was later sampled by Radiohead in their song idioteque on their Kid A album.
1979 saw Lansky beginning to work with LPC as a part of his computer music programming practice, and it was put to use in a series of compositions starting with Six Fantasies on a Poem by Thomas Campion. James Moorer at Stanford University had begun
Linear Predictive Coding based derivatives were pioneered by James Moorer at Stanford University in the 1970’s. His wife Hannah McKay reads the poem and LPC techniques and a variety of processing and filtering methords are used to alter and transform the reading in fabulous ways.
In his notes to the recording of Six Fantasies, he writes about how it has become common to view speech and song as distinct categories. Lansky thought that “they are more usefully thought of as occupying opposite ends of a spectrum, encompassing a wealth of musical potential. This fact has certainly not been lost on musicians: sprechstimme, melodrama, recitative, rap, blues, etc., are all evidence that it is a lively domain.”
Thomas Campion as composer and poet became an archetype emblematic of the “musical spectrum spanned by speech and song.” The poem used by Campion was his Rose cheekt Lawra which was embedded within his 1602 treatise Observations in the Art of English Poesie. Here Campion offered his attempt at a quantitative model for English poetry, where meter is determined by the quantity of vowels rather than by rhythm, as was done in ancient Latin and Greek poetry. Lansky describes the poem as a “wonderful, free-wheeling spin about the vowel box. It is almost as if he is playing vowels the way one would play a musical instrument, jumping here and there, dancing around with dazzling invention and brilliance, carefully balancing repetition and variation. The poem itself is about Petrarch's beloved Laura, whose beauty expresses an implicit and heavenly music, in contrast to the imperfect, all too explicit earthly music we must resign ourselves to make. This seemed to be an appropriate metaphor for the piece.”
Lansky continued to explore the continuum between speech and song with his pieces, Idle Chatter, just_more_idle_chatter, and, Notjustmoreidlechatter. Though clearly connected by theme, they are not a suite, but independent works. Idle Chatter from 1985 also continues with the use of his wife as vocalist, and the IBM 3081 as the means of transforming her voice, and again using a mix of LPC, stochastic mixing, and granular synthesis with a bit of help from the computer music language Cmix. If you like glossolalia, and if you ever wanted to try to hear what it sounded like at the Tower of Babel, these recordings are an opportunity.
Of Idle Chatter, Lansky wrote, ““The incoherent babble of Idle Chatter is really a pretext to create a complicated piece in which you think you can `parse the data’, but are constantly surprised and confused. The texture is designed to make it seem as if the words, rhythms and harmonies are understandable, but what results, I think, is a musical surface with a lot of places around which which your ear can dance while you vainly try to figure out what is going on. In the end I hope a good time is had by all (and that your ears learn to enjoy dancing).”
People had a strong reaction to the piece, and in response to their reaction, Lanksy wrote, just_more_idle_chatter in 1987. He gave the digital background singers more of a role in the piece, but the words still only approach intelligibility and never really reach a stage where the listener can comprehend what is being said, only that something is being spoken. The next saw his “stubborn refusal to let a good idea alone” with the realization of Notjustmoreidlechatter. Here again the chatter almost becomes something that can be discerned as a word before slipping back down into the primordial soup of linguistic babble. The last two of these pieces were made using the DEC MicroVaxII computer.
Over time, though Lansky wrote many more computer music pieces, and settings for traditionl instrumentation, he couldn’t just let the words just be. For the pieces on his Alphabet Book album he conducted further investigations in a magisterial reflection on the building blocks of thought: the alphanumerics, the letters and numbers, that allow for communication, the building up of knowledge, and contemplation.
.:. .:. .:.
Read the rest of the Radiophonic Laboratory: Telecommunications, Electronic Music, and the Voice of the Ether.
Fumitada Itakura, an oral history conducted in 1997 by Frederik Nebeker, IEEE History Center, Piscataway, NJ, USA.
Manfred Schroeder, an oral history conducted in 1994 by Frederik Nebeker, IEEE History Center, Piscataway, NJ, USA.
Charles Dodge: Speech Songs
Charles Dodge was another early computer musician who got in on the speech synthesis game. Born in Iowa in 1942 he was in his early twenties when he first became interested in the possibilities of computer music. As a graduate student at Columbia University he studied composition under Richard Hervig, Chou Wen-chung, and the electronic musician Otto Luening. When he met Godfrey Winham of at Princeton University, he began to think seriously about composing his own works with computers. Winham was an influential music theorist whose wife was a singer whose wife Bethany Beadslee was the voice for much new music, including Milton Babbit’s Philomel.
In the sixties Bell Labs was one of the very few places computer music was being made, and it was one of the few places to go to hear how it sounded. Max Matthews encouraged musicians who were making music on university computers to come to Bell Labs to convert it into sound, in the evening after the primary work at the Labs was finished. Charles Dodge was one of these composers, and when he came to listen to his work he became mesmerized by the fascinating sounds of the speech research going on down the hall, and often thought it was more interesting than the sounds he’d created using the computer.
In the early 70s he had the opportunity to create some new works at Bell Labs with access to programs written by Dr. Joseph Olive for speech synthesis. Olive was a leading researcher in the area of text-to-speech. Olive was one of those people who had an intense mathematical mind. He had received a physics PhD from the University of Chicago, but he was also interested in music.
With help from Olive and some poems written and given to him by his friend Mark Strand, Dodge went about creating Speech Songs. He writes, “I'd never been able to write very effective vocal music and here was an opportunity to make music with words. I was really attracted to that. It wasn't singing in the usual sense. It was making music out of the nature of speech itself. With the early speech-synthesis computers, you could do two things: you could make the voice go faster or slower than the speed in which it was recorded at the same pitch or you could shift the pitch independent of the speech rhythm. That was a kind of transformation that you couldn't make in the usual way of making tape music. It was fascinating to put my hands on two ways of modifying sound that were completely, newly available.”
To synthesize the electronic voices for the poems he used called speech-by-analysis. Only words that had put into the computer before using an an analog-to-digital converter could be synthesized. The recorded speech is analyzed by the computer to pull out the various parameters from the spoken word in short segments. Then speech can be recreated by the artificial voice using the same parameters as had been analyzed. For musical purposes, though, those parameters can be altered to change aspects of the sound such as shifting the pitch contour of a phrase or word into a melodic line. Change the speed without altering the pitch is another possibility. Formants and resonance are other aspects that can be changed by the programmer-composer.
The poems themselves are humorous and surrealistic, and the way the artificial voice reads them adds to the effect. Dodge was specifically interested in humor, because as he wrote in the liner notes, “Laughter at new music concerts, especially in New York, is rare these days.” He was delighted when audience members laughed at his creation. For a type of music that is so often cerebral and conceptual, its good when some belly laughs can be had.
Another piece on the album, The Story of Our Lives, also used techniques of speech synthesis. In this case instead of replacing the recorded human with an artificial voice, they changed the program so that it took from a bank of 64 sine tones that glissandoed at different rates. To create the effect of more than one voice being heard at a time, the different voices were mixed together on the digital computer.
Speech Songs came out in 1972 and in 1978 he put together a he made a recording of the radio In Casando by Samuel Beckett, where the musical aspect was two computer synthesized audio channels. This was also when he founded the center for computer music at CUNY’s Brooklyn College and began teaching for their graduate program. His 1970 composition, Earth’s Magnetic Field will be explored in chapter 8 of this book.
.:. .:. .:.
Read the rest of the Radiophonic Laboratory: Telecommunications, Electronic Music, and the Voice of the Ether.
Justin Patrick Moore
Husband. Father/Grandfather. Writer. Green wizard. Ham radio operator (KE8COY). Electronic musician. Library cataloger.