Roberto Pieraccini 2012. The Voice in the Machine. Building computers that understand speech. Cambridge, Massachusetts: The MIT Press. 325 pp.


The author of this book, Roberto Pieraccini, has been working in the field of speech recognition for most of his career and has plenty to say about the history of the field, from both theoretical and commercial points of view. Thought-provoking to read this in parallel with Tracy Kidder’s The Soul of a New Machine. The one written by a key researcher in the field; the other written by a mighty good journalist and writer. In both cases technology of great commercial value is the outcome. In both cases tricky technological concepts to explain. In both cases the technology has marched far on. A pity that both books were not written by Tracey Kidder.

Nevertheless, a rewarding and interesting book, at least in part. The initial chapters are about language and how humans speak languages, and about how humans have attempted to program machines to reproduce human speech. There is a lot of interesting history, a lot of which goes back to the 1950s Reproducing speech is a lot harder than learning to understand speech and to me not so interesting right now so I skipped much of this. Human babies understand what is being said about a year earlier than they can talk themselves. Dogs are still working on it. It would seem that a corollary of this is that when AI matures we can assume it will have understood what is going on long before it is able to articulate it, and thus we will be late to be aware of that, or perhaps may never be aware of it.

Chapter 3, Artificial Intelligence versus brute force, is again historical, about the details and development of expert systems. Chapter 4 is the one I read most closely, for the clear explanation of hidden Markov models in speech recognition. The gist of the chapter is in a quote from an early researcher in machine learning and statistical approaches to speech recognition:

Every time I fire a linguist, our system performance improves.

It seems that (leaving Andrey Markov aside), the breakthrough was Leonard Baum and Lloyd Welch at the Institute for Defence Analysis in the 1960s and 1970s. Starting with Leonard E. Baum and Ted Petrie (1966) Statistical Inference for Probabilistic Functions of Finite State Markov Chains The Annals of Mathematical Statistics 37: 1554-63. Since then of course many more applications in many different fields.

Chapters 5 6 and 7 are about building the machines that can understand, interpret and respond to speech. Chapters 8 and 9 are about the making these efforts commercially viable and useful. Chapter 10 is speculation about the future.

I think I came to this book through a timely discovery in a second-hand bookstore following a reference in Ray Kurzweil’s How to Create a Mind. I had been under the impression that Kurzweil and Roberto Pieraccini had worked together on the technology that was eventually bought by Apple and became Siri. However, reading this book and some web searches showed that perhaps they were in competing startups, at least some of the time? Haven’t got to the bottom of that and it doesn’t seem to matter much right now. Anyway, Roberto Pieraccini seems to be a bit coy about his role in all this, and he does nothing to clarify that in his epilogue, evidently added just before the book was published in 2012. The epilogue makes brief mention of how some researchers founded the company SRI in 2007 to build speech recognition software for smartphones, how 3 years later Apple bought them out ($200 million?), and the announcement in 2011, coincidentally a day before the death of Steve Jobs, that the next iPhone would include Siri.

Roberto Pieraccini has a suitably impressive web page tpp. In one of his blog posts. Roberto Pieraccini notes

A friend of mine used to say “A good general is a lazy general”, meaning that when you have to take an important decision, the more you delay it to gather more data, the better the chance is to take a good decision, eventually. The same concept applies to speech recognition. In fact modern state-of-the art speech recognizer … do not take any decision until they have gained enough evidence …

Just what I sometimes think I can feel the Hidden Markov Models in my mind doing during conversation.