Call it pomp and circuitry.
New research out of UCLA finds that ChatGPT “performs about as well as college undergraduates” when it comes to reasoning questions that often show up on standardized testing.
The language learning model has proverbially graduated to higher education since late 2022 when its “clean” writing style was described to The Post as that of “a very smart 12th-grader” by Furman University assistant philosophy professor Darren Hick.
UCLA’s study harnessed ChatGPT-3 — there is now a much more advanced, GPT-4 model available which outdid its predecessor in parts of the research — to be tested against 40 undergrad Bruins in Southern California.
GPT-3 scored around 80% correct on IQ questions based on Raven’s Progressive Matrices — it was “well within the range of the highest human scores” as the average person only got about 60% accurate.
Last January, GPT-3 was given a Wharton School MBA exam and scored in the same B to B- minus range as well.
Weeks before, Hick said he felt “abject terror” about how a program like ChatGPT could interfere in academia as he caught students allegedly using the program to do assignments.
At the time, his greatest fear was that the artificial intelligence would continue learning from its mistakes to a point where professors can no longer infer the difference between bot and human work.
The newest UCLA findings only fan the flames of such a worry.
“Surprisingly, not only did GPT-3 do about as well as humans but it made similar mistakes as well,” said senior study author and UCLA psychology professor Hongjing Lu.
Co-author Keith Holyoak even said “GPT-3 might be kind of thinking like a human.”
GPT-3 especially succeeded with analogical reasoning — believed to be a problem-solving trait exclusive to humans which uses rational thought and logical examples.
“Language learning models are just trying to do word prediction so we’re surprised they can do reasoning,” Lu added.
Still, these findings shouldn’t come as too much of a shock as people are responsible for GPT’s reinforcement learning from human feedback — the process of refining and enhancing its training data.
The program was also tasked with answering analogy questions from the college admissions exam, the SAT, which had never been published, as a failsafe to rule out that such questions were part of the model’s training data.
“They compared GPT-3’s scores to published results of college applicants’ SAT scores and found that the AI performed better than the average score for the humans,” according to a release on the study.
Next, researchers want to further understand how AI language models learn and improve their so-called IQs as much of the process’ logistics remain a mystery to the untrained public.
“People did not learn by ingesting the entire internet, so the training method is completely different [than that of people],” Holyoak said. “We’d like to know if it’s really doing it the way people do, or if it’s something brand new — a real artificial intelligence — which would be amazing in its own right.”
Source