It Still Doesn’t Really Matter What A.I. Can Score on IQ Tests

If we’re already bad at measuring human intelligence, is it possible to measure intelligence in something composed of algorithms?

Building an A.I. able to ace an SAT test doesn’t necessarily mean it’s equipped to do much else.


Measuring human intelligence is already a pretty controversial and complicated process, not the least because there’s stringent definition for what intelligence even is. So it goes without saying that trying to measure machine intelligence is another iteration of an already flawed process. But whereas applying human intelligence to a scale is perhaps unnecessarily reductionist, measuring machine intelligence is a fraught necessity. A.I. are designed with specific tasks and services in mind, so in order to say, “This iteration is more effective than another,” you need a framework that makes that comparison quantifiable.

To that end, researchers from China have just developed what is ostensibly a new kind of IQ test for A.I. systems and human beings alike. It’s not the first time scientists have attempted to peg an IQ number to A.I. (historically those programs barely test better than an average toddler). But the Chinese researchers, in a new preprint paper, say they’ve developed a unique standard for assessing IQ in different A.I. agents. They used it on a variety of different A.I. assistant services last year and found that Google Assistant was among the most intelligent programs currently available, while Apple’s Siri ranked last.

That sounds like brutal news for Apple, but there are reasons to take the results with a grain of salt. The paper released by the researchers lacks a clear outline of what exactly its tests were assessing. And since we know “intelligence” is so nebulous, that’s a real problem.

According to Oren Etzioni, an A.I. researcher and CEO of the Allen Institute for Artificial Intelligence in Seattle, that’s fine as long as you understand that measuring machine IQ is comparable to measuring human IQ—it’s perhaps more practical as a means of comparison than it is as a final word on “intelligence.” The same way humans give one another standardized tests like the SAT to decide who’s “smarter” than whom, A.I. scientists gives machines the same sort of tests to find machines’ strengths.

Though they don’t offer up their exact criteria, the trio of researchers say they designed their “standard intelligence model” to measure the following: the ability to obtain data from the external world; the ability to process that data in a way that’s understandable and analyzable; the ability to generate some sort of novel insight about the data; and the ability to feed those conclusions back into the external world, for others to obtain and respond to.

Based on those four criteria, the researchers came up with a means of testing intelligence that can be translated onto a 100-point scale. They purportedly administered their tests to actual human adults in 2014, and the average score for those 18 years or older was just around 97 points. A 6-year-old human averaged out to a score of 55.5.

In the 2016 A.I. testing rounds, no machine was able to crack the 50-point threshold, but Google Assistant got close. Based on testing in 2016, Google Assistant racked up a 47.28. The Chinese personal assistant Duer, created by Baidu, scored 37.20. Bing came out to 31.98. Apple’s Siri rounded out the top 10 with a score of 23.94.

It’s not quite clear why Siri would rank so low (the study’s researchers did not respond to inquiries, but we’ll update if they do), but one might presume Google’s recent pivot to an A.I.-first approach is paying off. Although Etzioni finds the lack of details in the new paper pretty concerning, he notes that other trials in 2016 comparing Google’s and Siri’s abilities to answer questions gave Google a slight edge, too.

But that doesn’t mean Google is actually more intelligent than Siri. Etzioni goes on to say both systems, and others like them, fall apart upon follow-up inquiries, “anything beyond these little doohickey, factoid-style questions” that ask things like what the weather is in Mumbai or who the third person to walk on the moon was. “These systems are trivia experts—no smarter than a doorknob.”

The problem is that A.I. are limited with a feeble understanding of how to even interpret the questions. A question that asks whether a plant will grow better under a window isn’t difficult because the machine doesn’t know what photosynthesis is (although language is a whole other obstacle for A.I.); it’s difficult because it can’t use common-sense experience to infer that a window will allow more light exposure. This chasm is what Etzioni calls the difference between general artificial intelligence that can operate like a human and narrow A.I. savants that can do one or few things really well and nothing else.

That’s partly what makes testing for IQ in A.I. so frustrating. Building an A.I. able to ace an SAT test doesn’t necessarily mean it’s equipped to do much else. Even if an A.I. agent can pull out information from the entire Encyclopedia Britannica on a penny-drop or solve complex integrals in calculus, it doesn’t mean it has the common-sense skills needed to go out and order a sandwich from the nearest deli. How would it know how to wait in line? Decide what’s a better combo deal? Charm the cook for an extra dollop of guacamole? The type of intelligence that requires these skills are emotional and social intelligence, which are much more difficult to test for. And it’s also why even if A.I. masters this IQ test, we still won’t have to worry us that the singularity is any closer.