The robot spoke, but what did it say? (Part 1) – An article by Dr. Patrice Caire

This is the first of the 3-part series “The Robot Spoke” by Dr. Patrice Caire, AI & Social Robotics Consulting Scientist.

The Robot Spoke: But What Did It Say?
Her name was Julie, and she was from a classy family of humanoid robots called Nao. It was 2015: Julie—and her cohort of robots—had been trying to understand human speech since, well, forever. Most robots were used in factories only, of course, hauling boxes and screwing nuts and bolts. But Julie was different. She had a future. She had a life.


Dr. Caire tests Julie’s ability to switch between languages—so she can interact with museum visitors.

How do I know so much about Julie’s life—her hopes and dreams, her speech patterns and how much of life she really understood? I was her handler, her agent, if you will. By 2014, I had created the Social Robotics Lab at the University of Luxembourg. And it was my sole purpose in life to help Julie fulfill her dreams: I had to teach Julie to speak, and, crucially, to be understood. By hordes of visitors at Luxembourg’s Museum of Modern Art, no less, and in twenty-seven languages.

It was a herculean task. But, armed with a computer-science PhD, and a team of twenty other experts and scientists, I was up for it. 

The Fantasy of Humans and Robots Kibitzing
We know that artificial Intelligence—and speech recognition, in particular—has had its ups and downs. Traditionally, when it’s on the upswing, speech recognition sees huge investments from behemoths like the Stanford Research Institute’s Artificial Intelligence Lab. When AI optimists can secure funding, they dangle the hope of harnessing speech recognition for everything from winning wars (military intelligence) to hand-feeding your great-aunt Angelica in her nursing home (AI for healthcare).


Star Wars 3-CPO     

The Holy Grail of speech recognition has always been the ability of humans and machines—like our robot-friend Julie—to carry on a flawless conversation about anything from the new Beyoncé video to the price of tea in China. If you think having small-talk with Miss Julie sounds easy, then think again: the bar for speech-recognition is sky-high. Remember the “Star Wars” robot C-3PO, R2-D2’s chatty little sidekick? Well, C-3PO was fluent in six million languages—that’s how grandiose the Hollywood version of speech-recognition has always been. 

Metropolis’ Maria

The coveted ability to understand human language—natural language—is what puts the “Intelligence” in AI. You could even argue that the search for speech-recognition systems is older than Artificial Intelligence itself. After all, AI is said to have begun in earnest in the 1950s; but wasn’t it already “a thing” way back in 1927, say, when Thea von Harbou wrote her novel Metropolis? Von Harbou and her husband, Fritz Lang, co-wrote the screenplay for the silent film “Metropolis,” the futuristic story of Maria, the poor worker from the underworld. In the film, human-Maria gets cloned into hyper-verbal robot-Maria, who trash-talks practically everyone, as part of a diabolical plan to take over the world.  

Talk about predicting the future—von Harbou’s speech-spewing robot arrived decades before mid-century scientists had even started their search for mouthy machines.

Now that we know the fantasy of robots talking, let’s look at the reality.

Who Decided Machines Could Even Think?
Arguably, the first brainiac to figure out that machines could think—let alone speak—was Alan Turing. You know this guy, the brilliant English mathematician and subject of the 2014 hit movie “The Imitation Game.” Turing envisioned and formulated the computer in the 1930s, and in 1950, Turing asked the ground-breaking question, “Can machines think?”—and the world was never the same.

Enter Betty and Claude Shannon. This dynamic duo noticed that machines were expected to use the same alphabet as English-speaking humans—the same twenty-six letters. No dice, they said. The Shannons proved that the blank space—which they dubbed the twenty-seventh letter—made language more readable to us all.

Sound important? It was. But the Shannons were just at the beginning of the First Golden Age of Artificial Intelligence—and the push to get robots to talk.

An Introduction to Speech-recognition, starring Noam Chomsky
John McCarthy, a mathematician, was one of the first people to bring together a gaggle of researchers to force computers to simulate human intelligence—including the ability to engage in meaningful conversation. McCarthy et. al. called their initiative the search for “Artificial Intelligence.” It was 1956, and McCarthy managed to produce the first AI programming language, LISP, which is still in use today. He went on to found the Stanford University AI Lab.  

Another quantum leap occurred in 1957, when the American linguist and super-sphynx Noam Chomsky proposed a set of grammatical rules to both generate a formal grammar and actually parse sentences (break them down into meaningful chunks). Chomsky’s seminal book, Syntactic Structures, paved the way for machine translation.

With Chomsky’s theory of language as a foundation, computer scientists thought that they, too, had the keys to the conversational castle. Scores of conversation programs, dialog systems, and query systems followed, many of which had suspiciously human names, like ELIZA, GUS, and SAD SAM.


Two children play with Seymour Papert Logo Turtle – Photo cover from Papert’s book “Mindstorms”, 1969.

Getting Robots to Listen (Though Not Yet Speak)
For as much as we love Chomsky and his contribution to speech recognition, we can’t credit him with singlehandedly teaching robots to talk. Nobody walked that path alone. During this First Golden Age of AI, many thinkers took Chomsky’s foundational language-theory and ran with it. Seymour Papert, of MIT, is a prime example. Around 1967, Papert pioneered talking machines by producing the “Logo Turtle,” a little mobile robot. Children could tell the robot what to do by typing in commands on a typewriter connected to a computer (a teletype machine). The robot couldn’t answer back (yet), but it could understand the request—and trace the designs kids told it to, right there on the floor. Papert in 1980 collaborated with the toy manufacturer Lego, on—you guessed it—Lego’s Mindstorms robotics kits.

Papert’s contribution to the talking robot went beyond his own: Papert’s doctoral student, Terry Winograd, wrote a pivotal program called SHRDLU, for his thesis. By typing your request, you could ask SHRDLU to move colored cubes and triangles on a computer screen, and SHRDLU would display a message like, “The red cube or the blue cube?” It was phenomenal: SHRDLU let users interact with the computer in human language (English)—rather than having to learn a programming language.

SHRDLU sure was a breakthrough—but its language comprised a mere 50 words. Fortunately, though, SHRDLU also had a good memory; when asked, it could tell you all of your previous requests. SHRDLU could even extract meaning from context: when you referred to an object as “it,” SHRDLU recognized “it” as the last object that had been moved. And there was more: when SHRDLU didn’t understand what you meant, SHRDLU complained, saying bluntly, “I don’t understand what you mean.” You could get a conversation going with SHRDLU—as long as that conversation was about cubes.

So, for as exciting as SHRDLU was, scientists knew that there was more to life than pushing around geometric shapes on an iridescent screen—even if some people predicted robots as useful mainly for schlepping boxes (cubes) in warehouses (computer screens). The hunt—for machines that understood natural language—was on.  

The Robot Gets a Body (But the Mouth Can’t Talk)
Even by the late 1980s, robots and humans still couldn’t converse casually. Something drastic had to happen, learning-wise. MIT prof Rodney Brooks identified exactly what robots—and scientists—needed to learn. Brooks stipulated that, like babies, robots, had to interact with the world around them if they were to learn to talk with humans. To some people, Brooks’ premise sounded too generic to be useful. But those people were wrong: Brooks’ approach was revolutionary, and he went on to show that “artificial” speech could only emerge organically, by interacting with real people in the real world (and not from the sterile environment of some canned computer program). For Brooks, human intelligence had to be “embodied” intelligence.

Cog on display at the MIT Museum

Enter Cog, arguably the world’s first humanoid robot. Born in 1993, Cog had a human-sized metal torso, which was plunked onto a metal stand; Cog’s arms spanned six-and-a-half feet. Despite being an unwieldy pile of metal, Cog was designed to think. As such, he/she/it/they were equipped with rudimentary speech-recognition software, an artificial voice, and microphones for ears. Cog could learn new things—and we’re not just talking about learning to play with a Slinky like a pro. Cog’s built-in cameras allowed it to check you out the moment you walked into the lab. Then Cog, ever the good host, would try to engage you in conversation.

 There was one tiny drawback, however: while Cog’s physical abilities were rather good, Cog’s cognitive abilities were not all there. In other words, Cog didn’t understand much of what you said.

Did we say tiny drawback? Actually, it was a dealbreaker. But Cog was also a meaningful start at talking robots.


Cynthia Breazeal’s Kismet social robot. Photo: Sam Ogden, 1998

Meet Kismet, the Robo-baby
Fortunately, Cog eventually had a kissing cousin: Kismet. In the great tradition of geniuses teaching other geniuses (remember Papert mentoring Winograd?), Brooks mentored Cynthia Breazeal when she was still a grad student at MIT. In 1996, Breazeal created Kismet, Cog’s infinitely more social relation. Kismet inherited Cog’s sponge-like mind. What’s more, Breazeal endowed Kismet with the ability to relate to people and form friendships—which is essential to all human communication. No computer scientist before Breazeal had focused on relatability as a key criterion for making robots interact. Breazeal’s innovation was epic: obsessed with how robots learn, she created the field of Social Robotics.

In face-to-face interactions, Kismet could win you over in the blink of an (artificial) eye. To understand your side of the conversation, Kismet read your lips. Kismet also responded to the modulation of your voice, by triggering its speech-processing and speech-recognition software. If Kismet liked what you said, Kismet cooed, leaned forward, and batted its eyelashes. If Kismet didn’t like what you said, Kismet drew back; lowered its pink, paper ears; and replied with a distinctly human note of sadness in its voice. (Aw, Kismet doesn’t like it.)

Breazeal built Kismet according to the developmental model of a six-month-old child: infants this age possess communication skills based almost entirely on responding to social cues. As such, Kismet could recognize a six-month-old’s vocabulary—including the phrase “I love you”—provided the speaker paused long enough between words.

Talking to Robots, Naturally
While Kismet was busy saying “I love you,” a pair of PhDs, Janet and Jim Baker, were hard at work trying to solve the riddle of speech recognition. Janet, a biophysicist, and Jim, a mathematician fell in love and gave birth to a dragon: the commercial speech-recognition system called Dragon Naturally Speaking. In their research, Janet focused on the sounds of human speech, while Jim analyzed the voice-signal patterns. When hooked up to a speaker, these voice signals produced sounds that a person could understand: language.

The Bakers’ approach to speech recognition was far from mainstream. Others believed, à la Chomsky, that, to recognize speech, a machine had to e.g., recognize and understand the rules of English grammar. The Bakers, in contrast, were determined to calculate the likelihood that two- or three-word combinations would appear together. Using these word combinations, the Bakers created a phonetic-word dictionary. Then, they invented an algorithm to make sense of it all.

The Bakers’ speech-recognition system had no knowledge of English grammar; nor did it have intelligence. “Dragon Naturally Speaking” software used—and still uses—only numbers. It was a heretical idea in its day—an idea that became a proven formula: by 1997, the Bakers’ system could convert a monumental 230,000 spoken words into digitized, English text—in real-time.

With Dragon Naturally Speaking software, you could finally talk to your computer like a long-lost friend. What happened next, in the annals of speech-recognition?

Stay tuned for Part 2 in this series, in the November #SmartReads: “The Robot Spoke: Machines Get the Hang of It.”