This is the second of the 3-part series “The Robot Spoke” by Dr. Patrice Caire, AI & Social Robotics Consulting Scientist.
A friendly warning: the modern history of AI is, well, controversial. So, buckle your seatbelt—and enjoy the bumpy ride.
In the previous article in this series, we saw how much of a struggle it was, despite the Golden Age of AI, for scientists to make robots move cubes around on a computer screen—let alone speak and understand natural language. By the late 1980s—the so-called Second Golden Age of AI—robots and humans still couldn’t converse casually. Something earth-shattering had to happen.
The Next Big Thing: Machine Learning
The field of Artificial Intelligence encompasses all possible approaches to simulating intelligence. Machine learning, one branch of AI, uses data and experience to train algorithms automatically. Deep learning, a sub-field of machine learning, in turn uses algorithms (called neural networks) to simulate the learning process.
At the turn of the twenty-first century, machine learning gained momentum as a way of teaching computers to talk. The art of learning by examples—the core of machine learning—was not new: Alan Turing had envisioned it. But many less-famous figures have made massive contributions to AI and its latest cousin, machine learning.
In the 1940s, thinkers of various kinds tried to get machines to learn by experience; they used the model of “neural networks,” among other approaches. Scientists had realized that human neurons—the cells found by the billions in our brains—could be thought of as electrical circuits that simulate human thought processes. So, the scientific brain trust took their simple neuroscientific model and recreated it as a series of mathematical models for computer science. These artificial neural networks, in the form of software, could learn to sort data into categories by filtering that data through artificial nodes. Brilliant!
Machines Start to Learn How to Learn

Frank Rosenblatt with a Mark I perceptron computer in 1960.
Photo: Bloomberg.
In 1956, computer scientist John McCarthy had learning machines firmly in mind when he experimented with getting people and computers to interact. Then, in 1958, psychologist and AI expert Frank Rosenblatt took these neuroscientific experiments further, creating the innovative software program called perceptron. Simulating human learning, Rosenblatt’s ingenious perceptron program could “learn” to distinguish cubes from spheres on a screen—the way a toddler would. The perceptron breakthrough débuted in The New York Times with the creepy headline, “Embryo of Computer Designed to Read and Grow Wiser.” Creepily or not, the Times nailed it by acknowledging that the program was at an early stage, performing only very simple tasks. Further innovation—and far more computing power—would be required to train large neural networks, with many layers of nodes, if perceptron were ever to “read and grow wiser.”

Dr. Kunihilo Fukushima, 2003.
Photo: Tokyo University.
Almost twenty years later, in 1979, Japanese computer scientist Kunihiko Fukushima designed the computer program called neocognitron, which could recognize handwritten characters in the Japanese alphabet. Then, in 1986, British-born computer scientist Geoff Hinton collaborated on a paper that identified “back-propagation,” which may sound like a way to grow petunias in your back garden, but is really a crucial machine-learning technique still in use today. Hinton and his cohort described how neural networks could do “deep learning.” Flash-forward to 2020—and deep learning is still the catchphrase of today.

Notice how the handwritten number 8 (upper left) is correctly recognized.

One key ingredient in the success of deep learning was the ability to create models for deeper, bigger, more interconnected neural networks. Other ingredients, which Hinton et. al. also brought to the table, were new techniques for training these neural networks, using many examples, and tons of data. What else was needed to get machines and robots to eventually talk via machine-learning techniques? Two additional technical elements, which only engineering and data-science could provide: computing power and data processing.
Then, like technical genies granting technical wishes, engineers in the mid-1990s delivered computing power in the form of a new type of computer processor, originally developed for high-quality gaming animations: the GPU. The Graphics Processing Unit— that overpriced thing we still buy in the name of having a “graphics card”—did all the computational heavy lifting needed to train neural networks.
Computers Win at Chess: #notimpressed

Photo: Scientific American
By 1996, computers were smart enough to play chess with people—and win. In fact, one of IBM’s projects, Deep Blue, pitted a computer against the world’s reigning chess champion, Gary Kasparov. On February 10, 1996, Deep Blue beat the pants off Kasparov, check-mating him on only 37 moves.
As a grad student in computer science in the era of Deep Blue, I was thrilled when the machine won. My professor at the time, Thomas Anantharaman, used the occasion to teach us students the very AI search techniques—and the actual code—that he had used in creating Deep Thought, the foundation for Deep Blue.
In celebrating Deep Blue’s triumph, I was in good company. A computer winning at chess represented the start of a new process: de-mystifying human intelligence.
Despite Deep Blue’s symbolic significance, however, the chess-winning program lacked the one characteristic crucial to speech-recognition: the ability to learn. After all, Deep Blue won by using core AI-search techniques, powered by brute computational force (it could evaluate 200 million possible moves per second!); Deep Blue did not necessarily win by learning how to play a better game of chess. So, for computer scientists across the board, the next move had to be building a better algorithm. Better algorithms would enable learning—and better decisions.
For most computer scientists of the 1990s, better algorithms were indeed the priority. An exception to this obsession was a lone thinker who would singlehandedly harness the power of data.
The Woman Who Made Machines See

Photo: Christie Hemm Klok
Before machines could talk, they learned to see. This would require integrating knowledge from electrical engineering, physics, and computer science. Voilà: Dr. Fei-Fei Li, currently of Stanford, is the woman who brought all these fields together. Drawing on degrees in all of the above, Li realized that even the “best” algorithm wouldn’t work if the data used did not reflect the real world. In 2006, Li went public with her idea of building a real-world dataset by creating a large online archive of images, carefully classified into thousands of categories. Li said at the time, “We decided we wanted to do something that was completely historically unprecedented, we’re going to map out the entire world of objects.”
In search of a conceptual model for this image archive, Li in 2007 met with Princeton Professor of Linguistics Christiane Fellbaum, co-founder of WordNet, a word archive. Soon, Li was envisioning the world’s most extensive image database, a ginormous dataset that matched images with words. Using WordNet’s database, developed by Fellbaum and Miller, Li designed and built ImageNet, which is now the world’s largest database of its kind.
Li’s ImageNet quickly became a benchmark—and a catalyst—for every computer scientist seeking to train their algorithm. As Li herself has said, “One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research.”
Indeed, by 2009, ImageNet was an extraordinary data source, consisting of 3.2 million images organized into 5,247 labels. And now, ImageNet boasts 15 million images organized into 22,000 labels—that’s more data than even Google’s benchmark dataset. So, lesson learned: Li taught us that the dataset, and not the algorithm alone, is “front and center” in the pursuit of the talking robot.

Photo: TengQi Ye
By figuring out how to build ImageNet, Li cracked the code of how to collect and organize millions of images. You’ll undoubtedly recognize the data-collection approach Li pioneered for academia: crowdsourcing. At the time, the concept of crowdsourcing was known only in e-commerce. Li made the imaginative leap of taking the existing crowdsourcing tool, in the form of Amazon’s Mechanical Turks, and re-purposing that tool for finding and labeling images. Li recalls the discovery: “Suddenly we found a tool that could scale….” In fact, the task of building ImageNet was so monumental, that even with the use of Mechanical Turks, the first basic dataset took two and a half years to complete.
But who knew that getting machines to “see” would end up being easier than getting them to “say”? Li knew, apparently. And notably, Li, along with Fukushima, Hinton, and their pals, was not striving to build artificial brains. Rather, all of these thinkers were searching for inspiration from neuroscience to create computer programs that could learn how to recognize patterns and sequences—in everything from speech, to handwriting, to images. So far, everyone from Turing to Li had been laying the foundation for building the talking robot.
Computers Get a Voice
You probably have a relationship with Siri,that hit-or-miss oracle we all love to hate. But do you think of Siri as a talking robot?
Technically, Siri ain’t no robot— Siri is an application. To qualify for “robot” status, Siri would need the ability to move independently and physically interact with the environment. Despite this, we still sort of love Siri. After all, Siri will forever be the first virtual assistant with a voice, a disembodied talking robot.
So, who gets the credit for inventing Siri? Apple? Well, while Siri was launched in the Apple store (February 2010), Siri was not an Apple creation. You could say Siri had a bunch of ancestors. First, there was DARPA (the Defense Advanced Research Projects Agency), a branch of the US military. Then, there was the Stanford Research Institute. In 2003, DARPA awarded the Stanford Research Institute $22 million for its “Personal Assistant that Learns” project. This was HUGE for the field of speech recognition: the Stanford Research Institute now had the opportunity to build on DARPA’s forty years of funded research.
Four years later, in 2007, the Stanford Research Institute was ready to develop a commercial application of their personal assistant. A holy trinity of experts took the lead: human/computer-interface aficionado Dr. Adam Cheyer; Norwegian entrepreneur Dag Kittlaus; and Tom Gruber, a computer scientist and psychologist. It was Kittlaus who had the honor of naming the invention: Siri is a Scandinavian name—Paul Auster’s second wife is named Siri [Hustvedt].
This trio of speech-recognition experts trained Siri to answer questions like, “What’s the weather like in New York today?” To enable Siri’s answer, the audio file of the question was sent off to remote servers (now known as the cloud), whereupon speech-recognition software transcribed the words into text—and Natural Language Processing software interpretedthe words.
This was the pivotal juncture: interpretation. Siri used machine learning algorithms and deep learning, along with large datasets of actual human voices. Siri’s training involved recognizing complexities of tone; accents; and intention in human language.
Two weeks after Siri’s launch via the Apple store, Steve Jobs tried to acquire the talkative digital assistant; two months later, the deal was done. The following year, 2011, an integrated version of the app appeared on the iPhone 4S—and Siri was born.
Why Interpreting Human Language is So F*%!ing Hard
I’m sure you recognize this common phrase in Siri’s lexicon: “Hmmm, I’m not sure.” We all know what it means: Siri just doesn’t get it.
Why is human language so difficult for a robot—even a disembodied one—to understand? One reason is that language is inherently ambiguous; it is profoundly dependent on context, as well as on a huge amount of background knowledge. Unfortunately for disembodied robots, this background info is common to humans who engage in verbal or written communication—and woefully unfamiliar to robots. Often, the linguistic ambiguities can only be resolved by applying real-world knowledge. Real-world knowledge—this innocuous-sounding term encompasses prior knowledge of underlying concepts in a multitude of contexts: physical, emotional, and social, among others.
So sure, as of 2011, Siri could tell you things about the weather and the menu of your favorite restaurant. But it would take more research—and more head-banging—for the field of computer science to go beyond our disembodied punching bag.
Stay tuned for Part 3 in this series, “The Robot Spoke—and Sounded Smarter than Ever.”
Related Posts
Part 5: The Robot Spoke—and Said, “Here’s What I’m Thinking….”
The Robot Spoke—and Said, “I love you” (Part 4) — An Article by Dr. Patrice Caire
The Robot Spoke—And Sounded Smarter Than Ever (Part 3) – An Article by Dr. Patrice Caire
The robot spoke, but what did it say? (Part 1) – An article by Dr. Patrice Caire