More

armcat · 2026-04-10T19:49:19 1775850559

As someone who's been working in legaltech space where MS Word add-in chatbot was a killer feature, this is brutal. And in their demo they are hammering on the legal case (redline chat).

_zara_arsson_ · 2026-04-10T19:54:40 1775850880

Doesn't this cut right into Legora's business?

armcat · 2026-04-04T15:42:42 1775317362

I still find it incredible at the power that was unleashed by surrounding an LLM with a simple state machine, and giving it access to bash

Yokohiii · 2026-04-04T16:50:42 1775321442

That is why I am currently looking into building my own simple, heavily isolated coding agent. The bloat is already scary, but the bad decisions should make everyone shiver. Ten years ago people would rant endlessly about things with more then one edge, that requires a glimpse of responsibility to use. Now everyone seems to be either in panic or hype mode, ignoring all good advice just to stay somehow relevant in a chaotic timeline.

HarHarVeryFunny · 2026-04-04T16:36:55 1775320615

At it's heart it's prompt/context engineering. The model has a lot of knowledge baked into it, but how do you get it out (and make it actionable for a semi-autonomous agent)? ... you craft the context to guide generation and maintain state (still interacting with a stateless LLM), and provide (as part of context) skills/tools to "narrow" model output into tool calls to inspect and modify the code base.

I suspect that more could be done in terms of translating semi-naive user requests into the steps that a senior developer would take to enact them, maybe including the tools needed to do so.

It's interesting that the author believes that the best open source models may already be good enough to complete with the best closed source ones with an optimized agent and maybe a bit of fine tuning. I guess the bar isn't really being able to match the SOTA model, but being close to competent human level - it's a fixed bar, not a moving one. Adding more developer expertise by having the agent translate/augment the users request/intent into execution steps would certainly seem to have potential to lower the bar of what the model needs to be capable of one-shotting from the raw prompt.

stanleykm · 2026-04-04T16:24:02 1775319842

unfortunately all the agent cli makers have decided that simply giving it access to bash is not enough. instead we need to jam every possible functionality we can imagine into a javascript “TUI”.

HarHarVeryFunny · 2026-04-04T17:12:38 1775322758

If all you want is a program that calls the model in a loop and offers a bash tool, then ask Claude Code to build that. You won't like it though!

For a preview of what it'd be like, just tell your AI chat app that you'll run bash commands for it, and please change the app in your "current directory" to "sort the output before printing it", or some such request.

senko · 2026-04-04T17:45:56 1775324756

Claude Code with Opus 4.6 regularly uses sed for multi-line edits, in my experience. On top of it, Pi is famously only exposing 4 tools, which is not just Bash, but far more constrained than CCs 57 or so tools.

So, yes, it can work.

HarHarVeryFunny · 2026-04-04T18:35:10 1775327710

I think the problem/limitation would be as much due to context management as tools. Obviously bash plus a few utilities is sufficient to explore/edit the code base, but I can't imagine this working reliably without the models being specifically trained to use specific tools, and recognize/adapt to different versions of them etc.

Context management, both within and across sessions, seems the bigger issue. Without the agent supporting this, you are at the mercy of the model compacting/purging the context as needed, in some generic fashion, as well as being smart enough to decide to create notes for itself tracking what it is doing, etc.

Apparently CC is 512K LOC, which seems massively bloated, but I do think that things like tools, skills, context management and subagents are all needed to effectively manage context and avoid the issues that might be anticipated by just telling the model it's got a bash tool, and go figure.

stanleykm · 2026-04-04T20:11:36 1775333496

You don’t really need most of that stuff. Have sensible steering files. Have the agent keep state itself. Dont bother compacting. Its fine.

HarHarVeryFunny · 2026-04-04T18:40:23 1775328023

I thought CC only supports it's find/replace edit tool (implemented by CC itself, using Node.js for file access), and is platform agnostic. Are you saying that on linux CC offers "sed" as a tool too? I can't imagine it offers "bash" since that's way too dangerous.

senko · 2026-04-04T20:04:37 1775333077

Yes, Claude Code has a Bash tool, and Claude in some cases uses the CLI sed utility (via the Bash tool) for file changes (although it has built-in file update), at least on my Linux machine.

HarHarVeryFunny · 2026-04-04T20:12:17 1775333537

Interesting - thanks.

I just asked Claude, and apparently CC makes it's bash tool available on all platforms it runs on (Linux, macOS, Windows WSL, Git for Windows), and doesn't do platform-specifc filtering of bash commands, which would seem to make for some interesting incompatibilities - GNU utils (sed, grep, find) on Linux and Windows, but BSD variants on macOS.

girvo · 2026-04-04T21:50:21 1775339421

Claude code will semi-regularly try to use GNU utils on my Mac

Yokohiii · 2026-04-04T17:41:35 1775324495

I think you get him wrong? He is already concerned about "bash on steroids" and current tools add concerning amounts of steroids to everything.

slopinthebag · 2026-04-04T18:26:54 1775327214

Claude Code gets smoked on benchmarks by an agent that has a single tool: tmux. So I think they might actually like that quite a bit.

HarHarVeryFunny · 2026-04-04T19:19:53 1775330393

What benchmarks are you referring to?

girvo · 2026-04-04T21:49:07 1775339347

> If all you want is a program that calls the model in a loop and offers a bash tool, then ask Claude Code to build that. You won't like it though!

Okay sure it’s technically more than just bash, but my own for-fun coding agent and pi-coding-agent work this way. The latter is quite useful. You can get surprisingly far with it.

stanleykm · 2026-04-04T17:44:41 1775324681

i did.. and thats what i use. obviously its a little more than just a tool that calls bash but it is considerably less than whatever they are doing in coding agents now.

emp17344 · 2026-04-04T19:25:13 1775330713

If you saw the Claude Code leak, you’d know the harness is anything but simple. It’s a sprawling, labyrinthine mess, but it’s required to make LLMs somewhat deterministic and useful as tools.

girvo · 2026-04-04T21:47:49 1775339269

That’s also because of how Claude Code was written. It doesn’t have to be that way per se.

xstas1 · 2026-04-04T19:59:27 1775332767

Hypothesis: it's a sprawling, labyrinthine mess because it was grown at high speed using Claude Code.

emp17344 · 2026-04-04T20:05:14 1775333114

There’s a lot of redundancy, because there has to be to make the system useful. It’s a hacked together mess.

efromvt · 2026-04-04T21:15:51 1775337351

It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use. CC probably gets some bloat because it tries to do a LOT more; and some bloat because it's grown organically.

emp17344 · 2026-04-04T21:25:40 1775337940

>It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use.

Do you have a source? Claude Code is the only genetic system that seems to really work well enough to be useful, and it’s equipped with an absolutely absurd amount of testing and redundancy to make it useful.

efromvt · 2026-04-06T03:26:18 1775445978

Should I read that as 'generic system'? Most hard data is with company internal evals, but for the well defined tasks externally it's been pretty easy to spin up a basic tool loop and validate. Did you have something in mind? [I don't necessarily count 'coding' as well-defined in the generic sense, so I suspect we're coming at this from different scopes re: the definition of 'LLMs somewhat deterministic and useful as tools']

alfiedotwtf · 2026-04-05T03:12:37 1775358757

I found replacing bash with python to be more useful… that way, it can craft whatever it desires without having to pipe a billion pieces of gum together

esafak · 2026-04-04T15:59:10 1775318350

Tools gave humans the edge over other animals.

Yokohiii · 2026-04-04T17:43:36 1775324616

And those tools regularly burnt cities to ashes. Took a long time to get it under control.

y0eswddl · 2026-04-04T18:34:05 1775327645

*burn - I'm not sure we've gotten that under control quite yet

armcat · 2026-03-28T11:14:46 1774696486

Not on the same extreme level, but I know that some coffee machines use a tiny CNN based model locally/embedded. There is a small super cheap camera integrated in the coffee machine, and the model does three things: (1) classifies the container type in order to select type of coffee, (2) image segmentation - to determine where the cup/hole is placed, (3) regression - to determine the volume and regulate how much coffee to pour.

killingtime74 · 2026-03-29T01:28:54 1774747734

Very cool, expensive machines?

armcat · 2026-03-25T16:53:08 1774457588

This is beautifully written and visualised, well done! The KL divergence comparisons between original and different quantisation levels is on-point. I'm not sure people realize how powerful quantisation methods are and what they've done for democratising local AI. And there are some great players out there like Unsloth and Pruna.

samwho · 2026-03-25T17:20:51 1774459251

Thank you! I was really surprised how robust models are to losing information. It seems wrong that they can be compressed so much and still function at all, never mind function quite closely to the original size.

Think we're only going to keep seeing more progress in this area on the research side, too.

buildbot · 2026-03-25T18:06:52 1774462012

You can even train in 4 & 8 bits with newer microscaled formats! From https://arxiv.org/pdf/2310.10537 to gpt-oss being trained (partially) natively in MXFP4 - https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...

To Nemotron 3 Super, which had 25T of nvfp4 native pretraining! https://docs.nvidia.com/nemotron/0.1.0/nemotron/super3/pretr...

naasking · 2026-03-26T14:32:59 1774535579

Newer quantization approaches are even better, 4-bits gets you no meaningful loss relative to FP16: https://github.com/z-lab/paroquant

Hopefully Microsoft keeps pushing BitNet too, so only "1.58" bits are needed.

I think fractional representations are only relevant for training at this point, and bf16 is sufficient, no need for fp4 and such.

buildbot · 2026-03-26T15:28:34 1774538914

Learned rotations for INT4 are cool! Seems similar to SpinQuant? https://arxiv.org/abs/2405.16406

In my personal opinion I don’t think the 1.58 bit work is going to make it into the mainstream.

Not sure why you think fractional representations are only useful for training? Being able to natively compute in lower precisions can be a huge performance boost at inference time.

naasking · 2026-03-26T16:47:50 1774543670

> Learned rotations for INT4 are cool! Seems similar to SpinQuant? https://arxiv.org/abs/2405.16406

Indeed, but much better! More accurate, less time and space overhead, beats AWQ on almost every bench. I hope it becomes the standard.

> In my personal opinion I don’t think the 1.58 bit work is going to make it into the mainstream.

I hope you're wrong! I'm more optimistic. Definitely a bit more work to be done, but still very promising.

> Being able to natively compute in lower precisions can be a huge performance boost at inference time.

ParoQuant is barely worse than FP16. Any less precise fractional representation is going to be worse than just using that IMO.

armcat · 2026-03-19T18:42:04 1773945724

This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!

rohan_joshi · 2026-03-19T18:51:21 1773946281

thanks a lot, our voice cloning model will be out by May. we're experimenting w some very cool ways of doing voice cloning at 15M but will have a range of models going upto 500M

armcat · 2026-03-19T20:01:48 1773950508

That's sick, looking forward to it! You have my email in the profile, please let me know when you do!

armcat · 2026-03-12T14:29:18 1773325758

Great work! Kind of reminds me of ell (https://github.com/MadcowD/ell), which had this concept of treating prompts as small individual programs and you can pipe them together. Not sure if that particular tool is being maintained anymore, but your Axe tool caters to that audience of small short-lived composable AI agents.

jrswab · 2026-03-12T15:06:58 1773328018

Thanks for checking it out! And yes the tool is indeed catering to that crowed. It's a need I have and thought others could use it as well.

armcat · 2026-03-05T09:23:28 1772702608

I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.

There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).

Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.

nowittyusername · 2026-03-05T10:05:29 1772705129

I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.

ilaksh · 2026-03-05T11:43:57 1772711037

For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.

pettyjohn · 2026-03-05T16:48:33 1772729313

+1

what's your use case and what specific LLMs are you using?

I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!

nowittyusername · 2026-03-05T20:43:24 1772743404

sent!

armcat · 2026-03-05T19:41:01 1772739661

Sure, feel free to reach out, just check my profile!

scotty79 · 2026-03-05T15:13:07 1772723587

I got PersonaPlex to run on my laptop (a beefy one) just by following the step by step instruction on their github repo.

The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.

_magiic_kards · 2026-03-05T18:28:42 1772735322

+1 on this pipeline! You can use a super small model to perform an immediate response and a structured output that pipes into a tool call (which may be a call to a "more intelligent" model) or initiates skill execution. Having this async function with a fast response (TTS) to the user + tool call simultaneously is awesome.

andreadev · 2026-03-05T19:33:24 1772739204

The framing in this thread is full-duplex vs composable pipeline, but I think the real architecture is both running simultaneously — and this library is already halfway there.

The fact that qwen3-asr-swift bundles ASR, TTS, and PersonaPlex in one Swift package means you already have all the pieces. PersonaPlex handles the "mouth" — low-latency backchanneling, natural turn-taking, filler responses at RTF 0.87. Meanwhile a separate LLM with tool calling operates as the "brain", and when it returns a result you can fall back to the ASR+LLM+TTS path for the factual answer. taf2's fork (running a parallel LLM to infer when to call tools) already demonstrates this pattern. It's basically how humans work — we say "hmm, let me think about that" while our brain is actually retrieving the answer. We don't go silent for 2 seconds.

The hard unsolved part is the orchestration between the two. When does the brain override the mouth? How do you prevent PersonaPlex from confidently answering something the reasoning model hasn't verified? How do you handle the moment a tool result contradicts what the fast model already started saying?

AlexeyBelov · 2026-03-06T06:36:36 1772778996

LLM slop.

cpill · 2026-03-06T20:09:59 1772827799

Don't be so hard on yourself :P

andreadev · 2026-03-06T08:08:08 1772784488

Which part specifically?

AlexeyBelov · 2026-03-07T05:01:55 1772859715

The part where it's in all your comments.

andreadev · 2026-03-07T05:31:35 1772861495

You are wrong but I am not going to keep going back and forth.

robotswantdata · 2026-03-05T17:03:04 1772730184

+ 1 , agree still prefer composable pipeline architecture for voice agents. The flexibility on switching LLM for cost optimisation or quality is great for scaled use cases.

biomcgary · 2026-03-05T17:20:52 1772731252

Do you know if any of these multi-stage approaches can run on an 8gb M1 Air?

armcat · 2026-03-05T19:40:11 1772739611

They should! If you take Parakeet (ASR), add Qwen 3.5 0.8B (LLM) and Kokoro 82M (TTS), that's about 1.2G + 1.6G + 164M so ~3.5GB (with overhead) on FP16. If you use INT8 or 4-bit versions then are getting down to 1.5-2GB RAM.

And you can always for example swap out the LLM for GPT-5 or Claude.

armcat · 2026-03-02T22:56:29 1772492189

This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova

nicktikhonov · 2026-03-02T22:58:13 1772492293

Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like

alfalfasprout · 2026-03-03T00:39:59 1772498399

Also consider using Cerebras' inference APIs. They released a voice demo a while back and the latency of their model inference is insane.

ilaksh · 2026-03-03T07:06:38 1772521598

I tried to use Cerebras and it was unbeatable at first, but the client didn't want to pay $1300 a month and the $50/month or pay as you go was just not reliable. It would give service unavailable errors or falsely claim we were over our rate limit.

Also Groq is very fast, but the latency wasn't always consistent and I saw some very strange responses on a few calls that I had to attribute to quantization.

riquito · 2026-03-03T06:05:40 1772517940

You may be interested in gemini-2.5-flash-preview-tts

Text in, audio out, so you can merge in a single step LLM+TTS (streamable)

https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flas...

armcat · 2026-03-02T09:43:46 1772444626

I've used itch.io before, it's great! I even made a game using some free assets and AI: https://github.com/acatovic/gothicvania-codex-demo

armcat · 2026-02-24T23:26:01 1771975561

This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane