For the specific issue parent is talking about, you really need to give various ...

alfiedotwtf · 2026-04-03T16:55:23 1775235323

After spending the past few weeks playing with different backends and models, I just can’t believe how buggy most models are.

It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.

Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.

Like seriously… how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!

Aurornis · 2026-04-03T21:23:25 1775251405

> It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.

The models usually run fine on the server targeted backends they’re released for.

Those projects you cited are more niche. They each implement their own ways of doing things.

It’s not the responsibility of model providers to implement and debug every different backend out there before they release their model. They release the model and usually a reference way of running it.

The individual projects that do things differently are responsible for making their projects work properly.

Don’t blame the open weight model teams when unrelated projects have bugs!

embedding-shape · 2026-04-03T17:25:43 1775237143

Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.

Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.

alfiedotwtf · 2026-04-04T05:39:29 1775281169

I haven’t tried any Qwen yet, but so far I’m sticking with gpt-oss-20B.

In terms of what I’m using, I’ve looked at anything that will fit on a MacBook Pro with 32Gb RAM (so with shared memory) - LFM2, Llama, Minstral, Ministral, Devstral, Phi, and Nemotron.

As for quantisation, I aim for the biggest that will fit while also not being too slow - so it all depends on the model. But I’ll skip a model if I can’t at least use a Q4_K_M.

Also, given that I also bump my context to at least 32K, because tooling sucks when the tooling definitions itself come close to 4096!

I can’t wait for RAM prices to come down!