> I consulted Claude chat and it admitted this as a major problem with Claude these days, and suggested that I should ask what are the coordinates of UI controls are on screenshot thus forcing it to look
If 3 years into LLMs even HNers still don't understand that the response they give to this kind of question is completely meaningless, the average person really doesn't stand a chance.
The whole “chat with an AI” paradigm is the culprit here. Priming people to think they are actually having a conversation with something that has a mind model.
It’s just a text generator that generates plausible text for this role play. But the chat paradigm is pretty useful in helping the human. It’s like chat is a natural I/O interface for us.
I disagree that it’s “just a text generator” but you are so right about how primed people are to think they’re talking to a person. One of my clients has gone all-in on openclaw: my god, the misunderstanding is profound. When I pointed out a particularly serious risk he’d opened up, he said, “it won’t do that, because I programmed it not to”. No, you tried to persuade it not to with a single instruction buried in a swamp of markdown files that the agent is itself changing!
I insist on the text generator nature of the thing. It’s just that we built harnesses to activate on certain sequences of text.
Think of it as three people in a room. One (the director), says: you, with the red shirt, you are now a plane copilot. You, with the blue shirt, you are now the captain. You are about to take off from New York to Honolulu. Action.
Red: Fuel checked, captain. Want me to start the engines?
Blue: yes please, let’s follow the procedure. Engines at 80%.
Red: I’m executing: raise the levers to 80%
Director: levers raised.
Red: I’m executing: read engine stats meters.
Director: Stats read engine ok, thrust ok, accelerating to V0.
Now pretend the director, when heard “I’m executing: raise the levers to 80%”, instead of roleplaying, she actually issue a command to raise the engine levers of a plane to 80%. When she hears “I’m executing: read engine stats”, she actually get data from the plane and provide to the actor.
See how text generation for a role play can actually be used to act on the world?
In this mind experiment, the human is the blue shirt, Opus 4-6 is the red and Claude code is the director.
For context I've been an AI skeptic and am trying as hard as I can to continue to be.
I honestly think we've moved the goalposts. I'm saying this because, for the longest time, I thought that the chasm that AI couldn't cross was generality. By which I mean that you'd train a system, and it would work in that specific setting, and then you'd tweak just about anything at all, and it would fall over. Basically no AI technique truly generalized for the longest time. The new LLM techniques fall over in their own particular ways too, but it's increasingly difficult for even skeptics like me to deny that they provide meaningful value at least some of the time. And largely that's because they generalize so much better than previous systems (though not perfectly).
I've been playing with various models, as well as watching other team members do so. And I've seen Claude identify data races that have sat in our code base for nearly a decade, given a combination of a stack trace, access to the code, and a handful of human-written paragraphs about what the code is doing overall.
This isn't just a matter of adding harnesses. The fields of program analysis and program synthesis are old as dirt, and probably thousands of CS PhD have cut their teeth of trying to solve them. All of those systems had harnesses but they weren't nearly as effective, as general, and as broad as what current frontier LLMs can do. And on top of it all we're driving LLMs with inherently fuzzy natural language, which by definition requires high generality to avoid falling over simply due to the stochastic nature of how humans write prompts.
Now, I agree vehemently with the superficial point that LLMs are "just" text generators. But I think it's also increasingly missing the point given the empirical capabilities that the models clearly have. The real lesson of LLMs is not that they're somehow not text generators, it's that we as a species have somehow encoded intelligence into human language. And along with the new training regimes we've only just discovered how to unlock that.
> I thought that the chasm that AI couldn't cross was generality. By which I mean that you'd train a system, and it would work in that specific setting, and then you'd tweak just about anything at all, and it would fall over. Basically no AI technique truly generalized for the longest time.
That is still true though, transformers didn't cross into generality, instead it let the problem you can train the AI on be bigger.
So, instead of making a general AI, you make an AI that has trained on basically everything. As long as you move far enough away from everything that is on the internet or are close enough to something its overtrained on like memes it fails spectacularly, but of course most things exists in some from on the internet so it can do quite a lot.
The difference between this and a general intelligence like humans is that humans are trained primarily in jungles and woodlands thousands of years ago, yet we still can navigate modern society with those genes using our general ability to adapt to and understand new systems. An AI trained on jungles and woodlands survival wouldn't generalize to modern society like the human model does.
And this makes LLM fundamentally different to how human intelligence works still.
Iteration is inherent to how computers work. There's nothing new or interesting about this.
The question is who prunes the space of possible answers. If the LLM spews things at you until it gets one right, then sure, you're in the scenario you outlined (and much less interesting). If it ultimately presents one option to the human, and that option is correct, then that's much more interesting. Even if the process is "monkeys on keyboards", does it matter?
There are plenty of optimization and verification algorithms that rely on "try things at random until you find one that works", but before modern LLMs no one accused these things of being monkeys on keyboards, despite it being literally what these things are.
Of course it doesn't matter indeed. What I was hinting at is if you forget all the times the LLM was wrong and just remember that one time it was right it makes it seem much more magical than it actually might be.
Also how were the data races significant if nobody noticed them for a decade ? Were you all just coming to work and being like "jeez I dont know why this keeps happening" until the LLM found them for you?
I agree with your points. Answering your one question for posterity:
> Also how were the data races significant if nobody noticed them for a decade ?
They only replicated in our CI, so it was mainly an annoyance for those of us doing release engineering (because when you run ~150 jobs you'll inevitably get ~2-4 failures). So it's not that no one noticed, but it was always a matter of prioritization vs other things we were working on at the time.
But that doesn't mean they got zero effort put into them. We tried multiple times to replicate, perhaps a total of 10-20 human hours over a decade or so (spread out between maybe 3 people, all CS PhDs), and never got close enough to a smoking gun to develop a theory of the bug (and therefore, not able to develop a fix).
To be clear, I don't think "proves" anything one way or another, as it's only one data point, but given this is a team of CS PhDs intimately familiar with tools for race detection and debugging, it's notable that the tools meaningfully helped us debug this.
> No, you tried to persuade it not to with a single instruction
Even persuade is too strong a word. These things dont have the motivation needed to enable persuation being a thing. Whay your client did was put one data point in the context that it will use to generate the next tokens from. If that one data point doesnt shift the context enough to make it produce an output that corresponds to that daya point, then it wont. Thats it, no sentience involved
It doesn’t help that a frequent recommendation on HN whenever someone complains about Claude not following a prompt correctly is to “ask Claude itself how to rewrite a prompt to get the result you want”.
Which sure, can be helpful, but it’s kinda just a coincidence (plus some RLHF probably) that question happens to generate output text that can be used as a better prompt. There’s no actual introspection or awareness of its internal state or architecture beyond whatever high level summary Anthropic gives it in its “soul” document et al.
But given how often I’ve read that advice on here and Reddit, it’s not hard to imagine how someone could form an impression that Claude has some kind of visibility into its own thinking or precise engineering. Instead of just being as much of a black box to itself as it is to us.
It’s not meaningless. It’s a signal that the agent has run out of context to work on the problem which is not something it can resolve on its own. Decomposing problems and managing cognitive (or quasi cognitive in this case) burden is a programmer’s job regardless of the particular tools.
This is way too strong isn't it? If the user naively assumes Claude is introspecting and will surely be right, then yeah, they're making a mistake. But Claude could get this right, for the same reasons it gets lots of (non-introspective) things right.
It's not too strong. If it answered from its weights, it's pretty meaningless. If it did a web search and found reports of other people saying this, you'd want to know that this is how it answered - and then you'd probably just say that here on HN rather than appealing to claude as an authority on claude.
They also said it "admitted" this as a major problem, as if it has been compelled to tell an uncomfortable truth.
GP here, this is indeed exactly whT I was getting at, thanks for wording it for me; you put it better than I would've.
In this specific case I'd go one step further and say that even if it did a web search, it's still almost certainly useless because of the low quality of the results and their outdatedness, two things LLMs are bad at discerning. From weights it doesn't know how quickly this kind of thing becomes outdated, and out of the box it doesn't know how to account for reliability.
Maybe I'm just being too literal, but I don't know if you're really disagreeing with me. I was disputing "the response they give to this kind of question is completely meaningless". An answer from its weights is out of date, but only completely meaningless if this is a completely new issue with nothing relevant in the training data. And, as you say, the answer could be search-based and up to date.
If 3 years into LLMs even HNers still don't understand that the response they give to this kind of question is completely meaningless, the average person really doesn't stand a chance.