More

samusiam · 2026-04-02T12:03:54 1775131434

These OSS model makers need to stop benchmarking against old models. Showing how it performs against Opus 4.5, GLM-5 when we have Opus 4.6 and GLM-5.1 just tells me that it's not comparable to SOTA.

samusiam · 2026-04-01T23:48:01 1775087281

I think there's a lot of methodological expertise that goes into collecting good eval data. For example, in many cases you need human labelers with the right expertise, well designed tasks, well defined constructs, and you need to hit interrater agreement targets and troubleshoot when you don't. Good label data is a prerequisite to the stuff that can probably be automated by the AI agent (improving the system to optimize a metric measured against ground truth labels). Data scientists and research scientists are more likely to have this skillset. And it takes time to pick up and learn the nuances.

samusiam · 2026-04-01T12:34:10 1775046850

I just checked competitors' codebases:

- Opencode (anomalyco/opencode) is about 670k LOC

- Codex (openai/codex) is about 720k LOC

- Gemini (google-gemini/gemini-cli) is about 570k LOC

Claude Code's 500k LOC doesn't seem out of the ordinary.

lelanthran · 2026-04-01T13:45:22 1775051122

> Claude Code's 500k LOC doesn't seem out of the ordinary.

Aren't all the other products also vibe-coded? "All vibe-coded products look like this" doesn't really seem to answer the question "Why is it so damn large?"

It's a repl, that calls out to a blackbox/endpoint for data, and does basic parsing and matching of state with specific actions.

I feel the bulk of those lines should be actions that are performed. Either this is correct or this is not:

1. If the bulk of those lines implement specific and simple actions, why is it so large compared to other software that implements single actions (coreutils, etc)

2. If the actions constitute only a small part of the codebase, wtf is the rest of it doing?

samusiam · 2026-04-01T13:55:34 1775051734

You're complaining about vibe coding while also complaining about how you "feel" about the code. Do you see the irony in that?

lelanthran · 2026-04-01T14:00:59 1775052059

>> I feel the bulk of those lines should be actions that are performed. Either this is correct or this is not:

> You're complaining about vibe coding while also complaining about how you "feel" about the code. Do you see the irony in that?

Where did I complain about how I feel about the actual code? I have feelings, negative ones, about the size of the code given the simple functionality it has, but I have no feelings on the code because I did not look at the code.

arandomhuman · 2026-04-01T14:17:50 1775053070

Are you ESL by any chance? You’re missing the forest for the trees.

johnisgood · 2026-04-01T12:55:14 1775048114

All of them are really, REALLY bad.

surajrmal · 2026-04-01T13:32:59 1775050379

Bad by whose definition? They work really well in my experience. They aren't perfect but the amount of hand holding has gone down dramatically and you can fix any glaring problems with a code review at the end. I work on a multimillion line code base which does not use any popular frameworks and it does a great job. I may be benefiting from the fact that the codebase is open source and all models have obviously been trained on it.

bdhtu · 2026-04-01T14:30:23 1775053823

It takes 10 seconds for Gemini CLI to load. 10 seconds to show an input field. This is for a CLI program.

For comparison, it takes me less time to load Chrome and go to gemini.google.com.

causal · 2026-04-01T14:40:42 1775054442

> They work really well in my experience.

Yeahhh strong disagree there, I find Codex and CC to be buggy as hell. Desktop CC is very bad and web version is nigh unusable.

oblio · 2026-04-01T13:45:08 1775051108

At least Gemini and Claude constantly break down with scrolling in various Linux terminals, something which was solved by countless TUIs decades ago.

I think a lot of the people prasing Claude & co are on Macs.

johnisgood · 2026-04-01T13:46:13 1775051173

Most of their issues have been solved a long time ago, with 1000x less code. It is depressing at this point. I really had no clue IT was in the shitters this much. I knew it was theatrical but I had no idea that it was by this much.

geodel · 2026-04-01T14:53:32 1775055212

All these AI tools teams have most valid excuse "We are just a bunch of people who only know Javascript/typescript/NodeJS. Please bear with us while we resolve 10,000 open issues."

samusiam · 2026-04-01T13:53:32 1775051612

I haven't seen the scrolling glitch in months, where previously it was happening multiple times a day. Also haven't seen anyone complain about it in quite some time. Pretty sure they have resolved that.

msully4321 · 2026-04-01T15:27:37 1775057257

They have not! If I am scrolled up while more output is produced, the scrollback jumps to the top pretty consistently.

oblio · 2026-04-01T16:55:30 1775062530

I'll try again but lately I've been using strictly the VS Code terminal. Gnome Terminal and Termux in Ubuntu 24.04 were unusable even with 1000 hacks.

causal · 2026-04-01T14:41:40 1775054500

I'm on a mac! And I still find bugs on a regular basis...

samusiam · 2026-04-01T12:27:53 1775046473

AI witch-hunters are even more annoying.

WarmWash · 2026-04-01T13:34:41 1775050481

Seriously, people are becoming deranged.

Drop an em dash or a bullet point and they go into spasms.

samusiam · 2026-04-01T11:42:56 1775043776

> I think the only reason it’s seen as good anywhere is there are a lot of tasteless and talentless people who can pretend they created whatever was curled out. This goes for code as well.

This is an oversimplification.

If you have taste and talent, then the LLM output you get is going to reflect that.

So on the one hand, yes: tasteless and talentless people won't know good output from bad output. On the other hand, people with taste and talent can actually get good output.

dgxyz · 2026-04-01T14:11:38 1775052698

No it’s not. That’s total rubbish.

You can’t coerce quality creative writing out of it however you attempt to gaslight it into doing so.

samusiam · 2026-04-01T16:51:57 1775062317

Well you're free to disagree but my experience has been counter to your position. I write both code and research / technical documentation. The quality of what the LLM produces is limited by the quality of ideas I give it initially (mind you, this is just a starting point), and the quality of my review of its output.

samusiam · 2026-03-30T22:50:55 1774911055

"by design, the recommendations will be average"

This couldn't be more wrong. The simplest refutation is just to point out that there are temperature and top-k settings, which by design, generate tokens (and by extension, ideas) that are less probable given the inputs.

samusiam · 2026-03-26T12:12:14 1774527134

"they don't output anything unless prompted"

Unprompted they're not unlike a human sleeping or in a coma. Those states don't preclude consciousness in other states.

olalonde · 2026-03-26T13:24:11 1774531451

That's besides the point though.

samusiam · 2026-03-17T12:02:25 1773748945

Vegan for 15 years. I cook 95% of my own meals, including black bean burgers, tofu, etc... Sometimes I want something that tastes like meat and I reach for a Beyond or Impossible burger. I don't need it. But I can't recreate its texture and flavor profile on my own. It's not "better" than other things I can cook. It's just different.

samusiam · 2026-03-04T11:03:02 1772622182

I can recall reading human-authored text like this for more than a decade.

samusiam · 2026-03-03T11:09:34 1772536174

But literally any decent agent can recommend existing services and help you set them up. And even help you help them set the services up for you. I do this with Claude all the time.