How do you measure artificial intelligence?
Since the idea first took hold in the 1950s, researchers have gauged the progress of AI by establishing benchmarks, such as the ability to recognize images, create sentences and play games like chess. These benchmarks have proved a useful way to determine whether AI is better able to do more things—and to drive researchers toward creating AI tools that are even more useful.
In the past few years, AI systems have surpassed many of the tests researchers have proposed, beating humans at many tasks. For researchers, the mission now is to create benchmarks that could capture the broader kinds of intelligence that could make AI truly useful—benchmarks, for instance, that can reflect elusive skills such as reasoning, creativity and the ability to learn. Not to mention areas like emotional intelligence that are hard enough to measure in humans.
An AI system, for instance, can perform well enough that humans can’t always tell whether, say, an image or a paragraph was created by a human or a machine. Or ask an AI system who won the Oscar for best actress last year and it would have no problem. But ask why the actress won, and the AI would be stumped. It would lack the reasoning, the contextualizing, the emotional understanding that is needed to adequately answer.
“We’ve done the easy part,” says Jack Clark, co-chair of the AI Index, a Stanford University report that tracks AI development. “The big question is, what do really ambitious benchmarks look like in the future, and what do they measure?”
After all, he says, “There is a big difference between telling me the right answer and telling me the right answer with a very good explanation.”
The Turing Test
A look back at benchmarks offers a vivid illustration of how far AI has come, and the challenges it still faces.
The first benchmark came from the English computer scientist Alan Turing. In 1950, Mr. Turing wrote, “I propose to consider the question, ‘Can machines think?’ ” To determine that, he described an experiment—later dubbed the Turing Test—in which a human judge considers a conversation between a person and a machine designed to generate humanlike responses. If the judge can’t correctly identify which conversationalist is the human, the machine passes what Mr. Turing called the Imitation Game.
Trying to pass the Turing Test was the goal for the earliest AI efforts (although according to Michael Wooldridge, head of the computer-science department at the University of Oxford, Mr. Turing himself considered it largely just a thought experiment).
It wasn’t until the 1990s when researchers started to shift the benchmarks from matching human intelligence to beating it on specific tasks, according to Prof. Wooldridge. “If you want a program to do something for you, there is no reason for it to be humanlike,” he says. “What you want it to do is make the best decision possible.”
Some advances were made, including an IBM computer program beating chess champion Garry Kasparov in 1997, a benchmark that was seen as a leap forward in AI development.
The AI spring
But advancements really picked up in the recent “AI spring,” which many would date as starting in 2012. That was the year the ImageNet Challenge—a test to see if an algorithm could correctly detect and identify what was shown in the photos contained in a database of 14 million pictures—saw a breakthrough. AlexNet, a type of AI algorithm called a neural network, got an error rate of 15.3%, a score 10.8 percentage points lower than the previous best attempt. After most competitors had error rates under 5% in 2017, the researchers behind the contest said they would work on a new, more challenging version.
In the past couple of years, systems that can understand natural language, as well as those that can accurately decipher digital images and video, have broken through a succession of benchmarks. For instance, in 2018, a benchmark called GLUE was released, requiring AI systems to pass tests such as recognizing if sentences are paraphrasing one another and determining if a movie review is positive, negative or neutral. Many of these tasks were beaten so quickly that the researchers upped it to SuperGLUE at the end of 2019. By January of this year, researchers working to create systems to beat the SuperGLUE benchmark had already surpassed what most humans are able to do.
These faster and faster timelines to beat benchmarks have researchers calling to structure benchmarks in a way that can keep up with the pace of innovation in AI.
“We now need to scale up the benchmarks,” says Prem Natarajan, the vice president of natural understanding for Amazon AMZN -1.12% Alexa.
Share Your Thoughts
What benchmarks would you like to see set for AI? Join the conversation below.
Amazon’s Alexa Prize, which challenges university students to create a chatbot that can naturally talk about various subjects for 20 minutes and get a score from a panel of human judges of at least 4.0 out of 5.0, is one of the company’s attempts to scale up a benchmark and make it last longer and test more abilities. The Alexa Prize has been held three times—the fourth iteration is ongoing—without any team hitting either the time or score, both of which are needed to win.
Last month, the company announced a new test called the TaskBot Challenge that it says is the first conversational AI challenge to incorporate both voice and images. The test will see how well an AI can go back and forth between voice and images to get customers the information they want, starting with cooking and home-improvement tasks. After the conversation ends, the users will be asked to judge the helpfulness of the TaskBot on a scale of 1 to 5, as well as provide free-form feedback.
“A lot of the useful things about AI is the interaction, how it brings in all these components into an integrated experience,” Mr. Natarajan says. “The moment it becomes interactive, a new variable enters into the benchmark, which is the unpredictability of human responses.”
Creativity benchmark
Creativity is another area that researchers are struggling to develop benchmarks for. “We don’t have a good definition of intelligence, and we don’t have a good definition of creativity,” says Mark Riedl, an associate professor at the Georgia Tech School of Interactive Computing. “But intelligence and creativity are inextricably linked.”
He proposed the Lovelace 2.0 test in 2014 as a way to measure creativity. In the paper, he proposes having a human judge give the AI a creative task—say, create a poem or paint a picture—based on a specific request. If the human judge finds that the AI followed the request, as well as any other parameters the judge might add, such as that the creation be novel or surprising, then the AI has passed. The judge could add more difficult requirements, such as going from a five-line poem to something longer, in subsequent rounds.
Like the Turing test, Lovelace was more of a thought experiment than an actual test. When it first came out, Dr. Riedl didn’t know of any AI that could pass even one round. But within a couple of years, papers documenting systems that could create an image based on a text prompt started to come out.Earlier this year, DALL-E, a program out of the AI research lab OpenAI, could make whimsical illustrations based on a user picking a few phrases, like a baby daikon radish in a tutu walking a dog.
“DALL-E is a very good example of a system that is potentially passing a round or two of the Lovelace test, which is something in 2014 I didn’t envision,” Dr. Riedl says.
As far as he knows, his test hasn’t been used by any researchers.
Accuracy vs. bias
As today’s AI systems get smarter, it has laid bare the systems’ shortcomings, and this in turn has prompted a need to alter some benchmarks. For instance, facial-recognition systems are strikingly good and yet can’t always recognize women with darker skin, in part because the AI systems are trained on data sets that are weighted toward photos of white men.
“The focus of all AI development was all on accuracy, especially when it came to benchmarks,” says Youjin Kong, an associate visiting professor of philosophy at Oregon State University who works on ethics and social philosophy in AI. “But what’s the purpose of competition to increase accuracy if the data set itself is biased?”
The result is that researchers are trying to create balanced data sets after earlier benchmarks rarely considered racism or sexism in the data sets or privacy for people included in them.
Dr. Kong also points out that how we evaluate human intelligence is changing. The focus used to be all on IQ tests, but now we are learning to value social intelligence and emotional intelligence. This reframing could influence how we measure AI’s competence and intelligence.
Gary Marcus, a professor emeritus of psychology at New York University and chief executive of AI startup Robust AI, agrees that areas such as emotional intelligence need breakthroughs and will be hard to define with benchmarks. But that shouldn’t stop researchers from exploring those areas, says Dr. Marcus. “Benchmarks shouldn’t be the only way we do science,” he says.
Ms. Snow is a writer in Washington, D.C. She can be reached at reports@wsj.com.
Journal Report
More in Artificial Intelligence
Copyright ©2021 Dow Jones & Company, Inc. All Rights Reserved. 87990cbe856818d5eddac44c7b1cdeb8