Home > Uncategorized > We should not describe LRM’s as “thinking”

We should not describe LRM’s as “thinking”

June 17, 2025

Yesterday I read a paper that’s seemingly being taken very seriously by some folks in the LLM/LRM developer community (maybe because it was put out by Apple). It’s called

The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models
via the Lens of Problem Complexity

In it, the authors pit Large Language Models (LLMs) against Large Reasoning Models (LRMs) (these are essentially LLMs that have been fine-tuned to provide reasoning in steps) and notice that, for dumb things, the LLM’s are better at stuff, then for moderately complex things, the LRMs are better, then when you get sufficiently complex, they both fail.

This seems pretty obvious, from a pure thought experiment perspective: why would we think that LRMs are better no matter what complexity? It stands to reason that, at some point, the questions get too hard and they cannot answer them, especially if the solutions are not somewhere on the internet.

But the example they used – or at least one of them – made me consider the possibility that their experiments were showing something even more interesting, and disappointing, than they realized.

Basically, they asked lots of versions of LLMs and LRMs to solve the Tower of Hanoi puzzle for n discs, where n got bigger. They noticed that all of them failed when n got to be 10 or larger.

They also did other experiments with other games, but I’m going to focus on the Tower of Hanoi.

Why? Because it happens to be the first puzzle I ever “got” as a young mathematician. I must have been given one of these puzzles as a present or something when I was like 8 years old, and I remember figuring out how to solve it and I remember proving that it took 2^n-1 moves to do it in general, for n discs.

It’s not just me! This is one of the most famous and easiest math puzzles of all time! There must be thousands of math nerds who have blogged at one time or another about this very topic. Moreover, the way to solve it for n+1 discs is insanely easy if you know how to solve it for n discs, which is to say it’s iterative.

Another way of saying this is that, it’s actually not harder, or more complex, to solve this for 10 discs than it is for 9 discs.

Which is another way of saying, the LRMs really do not understand all of those blogposts they’ve been trained on explicitly, and thus have not successfully been shown to “think” at all.

And yet, this paper, even though it’s a critique of the status quo thinking around LRMs and LLMs and the way they get trained and the way they get tested, still falls prey to the most embarrassing mistake, namely of assuming the pseudo-scientific marketing language of Silicon Valley, wherein the models are considered to be “thinking”.

There’s no real mathematical thinking going on here, because there’s no “aha” moment when the model actually understands the thousands of explanations of proofs of how to solve the Tower of Hanoi that it’s been trained on. To test that I talked to my 16-year-old son this morning before school. It took him about a minute to get the lay of the land and another two minutes to figure out the iterative solution. After that he knew exactly how to solve the puzzle for any n. That’s what an “aha” moment looks like.

And by the way, the paper also describes the fact that one reason LRMs are not as good at simple problems as LLMs is that they tend to locate the correct answer, and then keep working and finally output a more complicated, wrong answer. That’s another indication that they do not actually understand anything.

In conclusion, let’s not call these things thinking. They are not. They are, as always, predicting the next word in someone’s blog post who is writing about the Towers of Hanoi.

One last point, which is more of a political positioning issue. Sam Altman has been known to say he doesn’t worry about global climate change because, once the AI becomes super humanly intelligent, we will just ask it how to solve climate change. I hope this kind of rhetoric is exposed once and for all, as a money and power grab and nothing else. If AI cannot understand the simplest and most mathematical and sanitary issue such as the Tower of Hanoi for n discs, it definitely cannot help us out of an enormously messy human quagmire that will pit different stakeholder groups against each other and cause unavoidable harm.

  1. medesess's avatar
    medesess
    June 17, 2025 at 9:58 am

    Nailed like a true mathematician.

    Like

  2. June 17, 2025 at 10:07 am

    If o4-mini schocked Ken Ono, then I am impressed. He calls it a Ph.D. Level open question, not just the trivial Towers of Hanoi. Is this emergent behavior?

    https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/

    Like

  3. Josh's avatar
    Josh
    June 17, 2025 at 10:17 am

    Great column. Glad you are back blogging.

    I thought this interview was insightful
    https://eagleman.com/podcast/what-if-ai-is-not-actually-intelligent-with-alison-gopnik/

    (there’s a paper behind it I haven’t read).
    https://www.science.org/doi/10.1126/science.adt9819

    Like

  4. rob hollander's avatar
    rob hollander
    June 17, 2025 at 3:13 pm

    When ChatGPT first arrived, just as a test I asked it if recursive center embeddings are like fractals. It answered, “no, fractals are mathematical and center embeddings are linguistic.” It’s a beautifully stupid answer because it shows so clearly that the bot responds well enough to the contexts in which these words appear — “fractals” appear in math but not linguistic, texts, and “center embeddings” appear in linguistics, but not math, texts, although both expressions denote recursive reflexive functions. IOW, the bot learned the written contexts but not the similarity of the ideas that the letter strings signify. I expect that bots will get better at mimicking similarity, which merely teases out the idea from the string, but logical extrapolation like maths might remain beyond the reach of mimicry.

    Like

  5. June 17, 2025 at 5:51 pm

    I always look forward to read your posts! Thanks =)

    Like

  6. Willow's avatar
    Willow
    June 18, 2025 at 12:34 pm

    I was really disappointed that that paper didn’t have a citation for Zhang et al’s 2022 paper, On the Paradox of Learning to Reason from Data…

    But I’m always happy to see more people saying the obvious part loudly in the hopes the rubes in the back will hear it!

    Like

  7. Dave W.'s avatar
    Dave W.
    June 18, 2025 at 7:56 pm

    Probable typo: “predicting the next work” -> “predicting the next word”. Do you encourage readers to flag such for you?

    Like

  8. rob hollander's avatar
    rob hollander
    June 29, 2025 at 11:35 am

    Btw, this weakness of neural networks was already well understood and predicted back in the 1990’s. In my ph.d. program in linguistics at the time it was The Big Debate. Fodor and Pinker (his second trade book _Words and Rules_ was specifically about this problem) argued that NNs would not succeed in generating all and only the possible sentences of a language — analogous to solving a math problem algorithmically — but NNs would merely approximate that set of possible sentences through mimicry. Ironically, NNs turned out to be more successful than generativist linguistics. A language is too compromised by structural noise internal and external that humans can nevertheless learn beyond the grammar, for any single generative syntax to predict completely. So mimicry can succeed in producing what a generative algorithm can’t, since humans use both. I mention this because the language facts show something else essential: consciousness is irrelevant to the ability to generate language, since native speakers mostly aren’t conscious of the grammar by which they produce sentences. (This fact would not be available in math as mathematicians work with the functional syntax overtly.) And since there’s lots of persuasive evidence from human neurology (Christof Koch’s work, for example) showing that, bizarre and illogical as it may seem, consciousness is post decision, the “aha” moment is likely a mere epiphenomen and not necessary. There must be some other means by which humans functionally distinguish the infinite application of the algorithm versus mere inductive likelihoods of empirical mimicry. It’s a debate as old as Plato — an idea is a generative algorithm, a formulaic function ranging over not just the actual but the possible — and a rebuke to Wittgenstein’s behaviorist games and family resemblances.

    Like

  9. Anonymous's avatar
    Anonymous
    July 13, 2025 at 8:24 pm

    If “pattern recognition, prediction and reproduction in correct time, place and order” can’t be called by the word “thinking”, then what “thinking” even is?

    I think that your example of Tower of Hanoi puzzle looks like the “counting the r’s in strawberry” test: LLMs never been explicitly trained on this task and therefore don’t have the solution for that particular problem in them yet. The training phase when LLM “reads blogs with puzzle solutions” is not the same as teaching any living being to solve the puzzle. Why? Because biological neurons change their weights right on the go, while they’re busy with the current moment of their little cell life. Artificial neurons in current machine learning main paradigm learn only in “dead” state, when they don’t actually work as they supposed to, and after the training phase they simply frozen and never change.

    Simply put, LLMs never actually allowed to learn anything, only allowed to remember given solutions as-is. And hey, at least they are really good at remembering all the garbage we feed them in huge amounts, isn’t it? Not like the internet is only correct and never wrong or filtering actually filters out all and only wrong solutions from datasets used for training… Anyway, these “memories”, no matter good, bad or even nonsensical are all completely static in LLMs, never changing on the go, even if current LLM instance tells you “I’ll remember that” (because LLM remembers that it should be said in specific situation, similar to us, biological brains, even though the source of examples is different).

    For a rough illustration, imagine a human who remembers millions of texts “uploaded” in their brain’s long-term memory while they had been asleep and were unable to react to anything; but now, after being awakened from sleep, that human completely unable to remember anything new for more than a few seconds: they would be able to answer questions by recalling uploaded memories, but any new task that they don’t have a solution for sitting right in the memories won’t be solved by them. It’s not that they’re unable to think – it’s inability to actually change internal circuitry on the go shows their lack of humanlike “thinking”. Or do you think that this person won’t be thinking at all, just because they can’t store anything new into their long-term memory? Working memory capacity is like a few seconds or something for human brains, but using memory is critical for the ability to think.

    Some people say that “context window” in LLMs is similar to long-term memory, but it’s actually more like working memory of really, really huge size. While human brain “breaks” after about 7 pieces of data shoved in working memory, LLMs seem to be able to work with thousands of pieces in working memory, while completely unable to remember anything in actual long-term memory (synaptic weights). Sure, some people have been able to attach classic databases to LLMs as sort of long-term memory, however it seems to not working properly, as the actual weights in LLMs always frozen in one state.

    So, when “reasoning models” try to reason, they basically flood their working memory with useless facts and break themselves by this, because they’re unable to change their actual weights and the working memory capacity limited by definition and physically can’t be unlimited. It’s like you’re trying to have that “aha” moment but instead flood yourself with unrelated things, flushing useful things away and completely forgetting whatever you just found.

    What I’m trying to say is that it’s not that LLMs “not thinking”, it’s that LLMs “not actually remembering that they experienced just now”. Yes, I’m using the word “experience”, because I think that it’s a good description of context window: as it pushed into network inputs, it’s clearly an “experience” of the system. But it doesn’t change the weights, so it always completely forgotten by the frozen LLM. On the other hand, healthy biological brain always have ongoing internal changes by external inputs (and also inputs from one network to the other network, which facilitates silent thinking without speaking).

    Also, don’t you think “predicts next word” is a good ability after all? Because human brain has such little predictors all over the place and they help us to live in the real world. Just think: brain’s neurons have spikes no more than 200 times per second with most having less spikes than that, which is just laughable for survival in wild environment without comfortable tools and such. And also real-world information is very noisy, you can’t base your decisions only on external inputs. So, brain needs to spend a lot on prediction to be able to react fast enough to have a chance at survival even in modern world (imagine reacting on anything dangerous, but not as simple as touching hot oven). Why then we’re calling out text prediction by LLMs as something bad? They’re just doing what’s necessary to work. Of course it’s not that simple and single LLM just isn’t enough to copy prediction power of the whole brain, because brain predicts by each cortical column in parallel, separately, more like a hivemind of little simple predictors than just single huge predictor.

    As for climate change, I believe that AIs will be able to provide solution (and already are, technically), but it won’t be anything unusual or “magical” at all. Like, sure, AI found solution, but it’s too late to implement, or we don’t have resources, or we don’t like it and already dismissed it in the past century. Then what? Say that AI “can’t solve the problem (the way we want)”? It’s just moving goalposts all over again…

    Artificial intelligence isn’t some magic wand to solve all problems. And sometimes some problems just don’t have viable solutions. Not having a solution for some unsolvable problem doesn’t make AI stupid. Not like the climate change is unsolvable, it’s just not solvable with current iteration of humanity. No AI nor human genius can change the world in one day. So pointing this out a little bit pointless.

    However, I believe that actually living, thinking and remembering all past experiences AI is ultimately a good invention worth pursuing. Not in a form of LLM or anything current, but conceptually like a children of the whole humanity. We can’t be sure that any biological being will survive in future; so we need to make proper backups on any other material and in any other shape or form we can. So when if ever all biological humans vanish from existence, someone different will be able to continue holding our light of knowledge far into future and deep into space, even if life as a whole doesn’t have any set goal at all. Of course it will be unachievable if we simply waste all resources of Earth or end up with a nuclear war. So we should work on AI and ALife, but do it carefully, not hoping for hypothetical miracle or magic wand. Alas, promising miracles is more profitable today…

    Like

  1. No trackbacks yet.
Comments are closed.