While tech giants tout their AI models as virtually human-like in reasoning capabilities, Apple's latest study has punched some serious holes in those claims. The research team put top models from OpenAI, Google, Anthropic, and DeepSeek through their paces using puzzles like Tower of Hanoi and River Crossing instead of standard math problems. Spoiler alert: the results weren't pretty.
These so-called "intelligent" systems collapsed completely when puzzle complexity increased. Not just struggled—failed completely. Zero percent accuracy. And no, it wasn't because they ran out of computational juice. The models had plenty of resources and were even handed the solution algorithms on a silver platter. They still bombed.
What's bizarre is how inconsistent these AI darlings performed. Some could solve puzzles requiring over 100 moves but crashed and burned on simpler ones needing just 11 steps. Make it make sense! The study revealed these systems have weird behavioral quirks too. They overthink simple problems, waste time on wrong paths, and—get this—actually reduce their thinking effort when tasks get harder. Exactly backward from how you'd want an "intelligent" system to behave.
The real kicker? These models aren't reasoning at all. They're just pattern-matching machines. Feed them unfamiliar problems, and they're lost. The paper specifically challenges the industry's loose use of terms like "reasoning" and "thinking" when describing these models' capabilities. Even the fancy "reasoning models" (LRMs) hit the same complexity wall as their standard cousins. These AI models often give up when confronted with challenging, unfamiliar problems.
Apple's timing is no accident. Dropping this bombshell just before WWDC 2025 raises eyebrows about the company's own AI plans. But the message is clear: the emperor's new clothes aren't nearly as impressive as advertised. Unlike the hoped-for AGI development projected for 2030, current AI remains firmly bound by human-set limitations.
The study exposes a fundamental flaw in how we measure AI capabilities. Current benchmarks are misleading because models train on similar data. It's like studying for a specific test rather than actually learning the subject.
Despite the marketing hype from Silicon Valley's biggest names, genuine machine reasoning remains frustratingly out of reach.

