gemini - kennedy.gemi.dev

💾 Archived View for spam.works › mirrors › textfiles › music › talking.mac captured on 2023-11-14 at 10:55:05.
-=-=-=-=-=-=-
From ms20@u.washington.edu Sat Sep  3 22:48:53 1994
Date: Sat, 3 Sep 1994 19:19:44 -0700 (PDT)
From: HIgH TeCH <ms20@u.washington.edu>
To: analogue <analogue@magnus.acs.ohio-state.edu>
Subject: Talking Machines (Long!)

This is an excerpt taken from J.L.Flanagan's Speech Analysis, Synthesis,
and Perception, Second Edition  Pages 204-211

The text is reproduced as it is in the book, except where references to
illustrations were made.  

I mainly wanted to expose the readers to the history of speech synthesis 
preceding the Vocoder, so anything actually involving the Vocoder is not 
included in this text.  Don't let that discourage you.  This is good reading!

Enjoy,

Romeo Fahl
++++++++++
ms20@u.washington.edu

-------------------------------------------------------------------------


SPEECH SYNTHESIS	
----------------


	Ancient man often took his ability of speech as a symbol of divine
origin.  Not unnaturally, he sometimes ascribed the same ability to his gods. 
Pagan priests, eager to fulfill great expectations, frequently tried to make
their idols speak directly to the people.  Talking statues, miraculous
voices and oracles were well known in the Greek and Roman civilizations -
the voice usually coming to the artificial mouth via cleverly concealed
speaking tubes.  Throughout early times the capacity of "artificial speech"
to amaze, amuse and influence its listeners was remarkably well appreciated
and exploited.
	As the civilized world entered the Renaissance scientific
curiousity developed and expanded.  Man began to inquire more seriously
into the nature of things.  Human life and physiological functions were
fair targets of study, and the physiological mechanism of speech belonged
in this sphere.  Not surprisingly, the relatively complex vocal mechanism
was often considered in terms of more tractable models.  These early models
were invariably mechanical contrivances, and some were exceedingly clever
in design.

MECHANICAL SPEAKING MACHINES: HISTORICAL EFFORTS
------------------------------------------------

	One of the earliest documented efforts at speech synthesis was by
Kratzenstein in 1779.  The Imperial Academy of St.Petersburg offered its
annual prize for explaining the physiological differences between five
vowels, and for making apparatus to produce them artificially.  As the
winning solution, Kratzenstein constructed acoustic resonators with
vibrating reeds which, in a manner analogous to the human vocal cords,
interrupted an air stream.
	A few years later (1791), Von Kempelen constructed and demonstrated
a more elaborate machine for generating connected utterances.  [Apparently
Von Kempelen's efforts antedate Kratzenstein's, since Von Kempelen
pruportedly began work on his device in 1769 (Von Kempelen; Dudley and
Tarnoczy).] Although his machine received considerable publicity, it was
not taken as seriously as it should have been.  Von Kempelen had earlier
perpetrated a deception in the form of a mechanical chess-playing machine.
The main "mechanism" of the machine was a concealed, legless man - an
expert chess player.
	The speaking machine, however, was a completely legitimate device.
It used a bellows to supply air to a reed which, in turn, excited a single,
hand-varied resonator for producing voiced sounds.  Consonants, including
nasals, were simulated by four separate constricted passages, controlled by
the fingers of the other hand.  An improved version of the machine was
built from Von Kempelen's description by Sir Charles Wheatstone (of the
Wheatstone Bridge, and who is credited in Britain with the invention of the
telegraph).  
	Briefly, the device was operated in the following manner.  The
right arm rested on the main bellows and expelled air through a vibrating
reed to produce voiced sounds.  The fingers of the right hand controlled
the air passages for the fricatives /?/ and /s/, as well as the "nostril"
openings and the reed on-off control.  For vowel sounds, all the passages
were closed and the reed turned on.  Control of vowel resonances was
effected with the left hand by suitably deforming the leather resonator at
the front of the device.  Unvoiced sounds were produced with the reed off,
and by a turbulent flow through a suitable passage.  In the original work,
Von Kempelen claimed that approximately 19 consonant sounds could be made
passably well.
	Von Kempelen's efforts probably had a more far-reaching influence
than is generally appreciated.  During Alexander Graham Bell's boyhood in
Edingburgh, Scotland (latter 1800's), Bell had an opportunity to see the
reproduction of Von Kempelen's machine which had been constructed by
Wheatstone.  He was greatly impressed with the device.  With stimulation
from his father (Alexander Melville Bell, an elocutionist like his own
father), and his brother Melville's assistance, Bell set out to construct a
speaking automaton of his own.
	Following their father's advice, the boys attempted to copy the
vocal organs by making a cast from a human skull and molding the vocal
parts in the gutta-percha.  The lips, tongue, palate, teeth, pharynx, and
velum were represented.  The lips were a frame-work of wire, covered with
rubber which had been stuffed with cotton batting.  Rubber checks were
enclosed in the mouth cavity, and the tongue was simulated by
wooden sections - likewise covered by a rubber skin and stuffed with
batting.  The parts were actuated by levers controlled from a keyboard.  A
larynx "box" was constructed of tin and had a flexible tube for a windpipe.
A vocal cord orifice was made by stretching a slotted rubber sheet over tin
supports.
	Bell says the device could be made to say vowels and nasals and
could be manipulated to produce a few simple utterances (apparently well
enough to attract the neighbors).  It is tempting to speculate how this
boyhood interest may have been decisive in leading to U.S. patent No.
174,465, dated February 14, 1876 - describing the telephone, and which has
been perhaps one of the most valuable patents in history.
	Bell's youthful interest in speech production also led him to
experiment with his pet Skye terrier.  He taught the dog to sit up on his
hind legs and growl continuously.  At the same time, Bell manipulated the
dog's vocal tract by hand.  The dog's repertoire of sounds finally
consisted of the vowels /a/ and /u/, the diphthong /ou/ and the syllables
/ma/ and /ga/. His greatest linguistic accomplishment consisted of the
sentence, "How are you Grandmamma?" The dog apparently started taking a
"bread and butter" interest in the project and would try to talk by
himself. But on his own, he could never do better than the usual growl.
This, according to Bell, is the only foundation to the rumor that he once
taught a dog to speak.
	Interest in mechanical analogs of the vocal system continued to the
twentieth century.  Among those who developed a penetrating understanding
of the nature of human speech  was Sir Richard Paget. Besides making
accurate plaster tube models of the vocal tract, he was also adept at
simulating vocal configurations with his hands.  He could literally "talk
with his hands" by cupping them and exciting the cavities either with a
reed, or with thelips made to vibrate after the fashion of blowing a
trumpet.
	Around the same time, a different approach to artificial speech was
taken by people like Helmholtz, D.C. Miller, Stumpf, and Koenig.  Their
view was more from the point of perception than from production.  Helmholtz
synthesized vowel sounds by causing a sufficient number of tuning forks to
vibrate at selected frequencies and with prescribed amplitudes.  Miller and
Stumpf, on the other hand, accomplished the same thing by sounding organ
pipes.  Still different, Koenig synthesized vowel spectra from a siren in
which air jets were directed at rotating, toothed wheels.
	At least one more-recent design for a mechanical talker has been
put forward (Riesz, unpublished, 1937). Air under pressure is brought from
a reservoir at the right.  Two valves control the flow.  The first valve
admits air into a chamber in which a reed is fixed. The reed vibrates and
interrupts the air flow much like the vocal cords.  A spring-loaded slider
varies the effective length of the reed and changes its fundamental
frequency.  Unvoiced sounds are produced by admitting air through the
second valve.  The configuration of the vocal tract is varied by means of
nine movable members representing the lips, teeth, tongue, pharynx, and
velar coupling.
	To simplify the control, Riesz constructed the mechanical talker
with finger keys to control the configuration, but with only one control
each for lips and teeth (which worked in opposition to each other). The
different members were covered with a soft rubber lining to accomplish
realistic closures and dampings.  Two keys (4 and 5) operate excitation
valves (V4 and V5), arranged somewhat differently than the first two.
Valve V4 admits air through a hole forward in the tract for producing
unvoiced sounds.  Valve V5 supplies air to the reed chamber for voiced
excitation.  In this case pitch is controlled by the amount of air passed
by the valve V5.  When operated by a skilled person, the machine could be
made to simulate connected speech.  One of its particularly good utterances
was reported to be "cigarette".


ELECTRICAL METHODS FOR SPEECH SYNTHESIS
---------------------------------------


	With the evolution of electrical technology, interest in speech
synthesis assumed a broader basis.  Academic interest in the physiology and
acoustics of the signal-producing mechanism was supplemented by the
potential for communicating at a distance.  Although "facsimile waveform"
transmission of speech was the first method to be applied successfully
(i.e. in the telephone), many early inventors appreciated the resonance
nature of the vocal system and the importance to intelligibility of
preserving the short-time amplitude spectrum *.  Analytical formulation and
practical application of this knowledge were longer in coming.
	
SPECTRUM RECONSTRUCTION TECHNIQUES
----------------------------------

	Investigators such as Helmholtz, D.C. Miller, R. Koenig and Stumpf
had earlier noted that speech-like sounds could be generated by producing
an harmonic spectrum with the correct fundamental frequency and relative
amplitudes.  In other words, the signal could be synthesized with no
compelling effort at duplicating the vocal system, but mainly with the
objective of producing the desired percept.  Among the first to demonstrate
the principle electrically was Stewart, who excited two coupled resonant
electrical circuits by a current interrupted at a rate analogous to the
voice fundamental.  By adjusting the circuit tuning, sustained vowels could
be simulated.  The apparatus was not elaborate enough to produce connected
utterances.  Somewhat later, Wagner devised a similar set of four
electrical resonators, connected in parallel, and excited by a buzz-like
source.  The outputs of the four resonators were combined in the proper
amplitudes to produce vowel spectra.
	Probably the first electrical synthesizer which attempted to
produce connected speech was the Voder (Dudley, Riesz, and Watkins). It was
basicaly a spectrum-synthesis device operated from a finger keyboard.  It
did, however, duplicate one important physiological characteristic of the
vocal system, namely, that the excitation can be voiced or unvoiced.  
	The "resonance control" box of the divice contains 10 contiguous
band-pass filters which span the speech frequency range and are connected
in parallel. All the filters receive excitation from either the noise
source or the buzz (relaxation) oscillator.  The wrist bar selects the
excitation source, and a foot pedal controls the pitch of the buzz
oscillator.  The outputs of the band-pass filters pass through
potentiometer gain controls and are added.  Ten finger keys operate the
potentiometers.  Three additional keys provide a transient excitation of
selected filters to simulate stop-consonant sounds.
	This speaking machine was demonstrated by trained operators at the
World's Fairs of 1939 (New York) and 1940 (San Francisco).  Although the
training required was quite long (on the order of a year or more), the
operators were able to "play" the machines - literally as though they were
organs or pianos - and to produce intelligible speech **.  More recently,
further research studies based upon the Voder principle have been carried
out (Oizumi and Kubo).

----

	



	Prominent among this group was Alexander Graham Bell. The events - in
connection with experiments on the "harmonic telegraph" - that led Bell, in
March of 1876, to apply the facsimile waveform principle are familiar to
most students of communication.  Less known, perhaps, is Bell's conception
of a spectral transmission method remarkably similar to the channel
vocoder.
	Bell called the idea the "harp telephone".  It consisted of an
elongated electromagnet with a row of steel reeds in the magnetic circuit.
The reeds were to be arranged to vibrate in proximity to the pole of the
magnet, and were to be tuned successively to different frequencies.  Bell
suggested that "-they might be considered analogous to the rods in the harp
of Corti in the human ear".  Sound uttered near the reeds would cause to
vibrate those reeds corresponding to the spectral structure of the sound.
Each reed would induce in the magnet an electrical current which would
combine with the currents produced by other reeds into a resultant complex
wave.  The total current passing through a similar instrument at the
receiver would, Bell thought, set identical reeds into motion and reproduce
the original sound (Watson).
	The device was never constructed.  The reason, Watson says, was the
prohibitive expense!  Also, because of the lack of means for amplification,
Bell thought the currents generated by such a device might be too feeble to
be practicable.  (Bell found with his harmonic telegraph, however, that a
magnetic transducer with a diaphragm attached to the armature could, in
fact, produce audible sound from such feeble currents.)
	The principle of the "harp telephone" carries the implication that
speech intelligibility is retained by preserving the short-time amplitude
spectrum.  Each reed of the device might be considered a combined
electro-accoustic transducer and bandpass filter.  Except for the mixing of
the "filter" signals in a common conductor, and the absence of rectifying
and smoothing means, the spectrum reconstruction principle bears a striking
resemblance to that of the channel Vocoder.


	*



H.W. Dudley retired from Bell Laboratories in October 1961.  On the
completion of his more than 40 years in speech research, one of the Voder
machines was retrieved from storage and refurbished.  In addition, one of
the original operators was invited to return and perform for the occasion.
Amazingly, after an interlude of twenty years, the lady was able to sit
down to the console and make the machine speak.