I tried Emacs, but realised I need NixOS to get the packages I depend on like git to download my config. I can't use stock emacs. There's a trick to get Emacs and termux to share packages, but not for nix-on-droid :/
I wish they had a revenue goal to release openly, that way spending money in them would contribute to better open models in the long run.
This is how I view that the public can fund and eventually get free stuff, just like properly organized private highways end up with the state/society owning a new highway after the private entity that built it got the profits they required to make the project possible.
> Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?
Not exactly, but maybe I haven't used large enough NUMA machines to run tiny jobs?
I think usually parallel saturates my CPU and I'd guess most CPU schedulers are NUMA-aware at this point.
If you care about short tasks maybe parallel is the wrong tool, but if picking the task to run is the slow part AND you prefer throughput over latency maybe you need batching instead of a faster job scheduling tool.
I'm pretty sure parallel has some flags to allow batching up to K-elements, so maybe your process can take several inputs at once. Alternatively you can also bundle inputs as you generate them, but that might require a larger change to both the process that runs tasks and the one that generates the inputs for them.
parallel works fine so long as the time per job is on the order of seconds or longer.
Let me give you an example of a "worst-case" scenario for parallel. Start by making a file on a tmpfs with 10 million newlines
yes $'\n' | head -n 10000000 > /tmp/f1
So, now lets see how long it takes parallel to push all these lines through a no-op. This measures the pure "overhead of distributing 10 million lines in batches". Ill set it to use all my cpu cores (`-j $(nproc)`) and to use multiple lines per batch (`-m`).
time { parallel -j $(nproc) -m : <f1; }
real 2m51.062s
user 2m52.191s
sys 0m6.800s
Average CPU utalization here (on my 14c/28t i9-7940x) is CPU time / real time
Note that there is 1 process that is pegged at 100% usage the entire time that isnt doing any "work" in terms of processing lines - its just distributing lines to workers. If we assume that thread averaged about 0.98 cores utalized, it means that throughout the run it managed to keep around 0.066 out of 28 CPUs saturated with actual work.
Now let's try with frun
. ./frun.bash
time { frun : <f1; }
real 0m0.559s
user 0m10.409s
sys 0m0.201s
Interestingly, if we look at the ratio of CPU utilization (spent on real work):
18.9803220036 / 0.066 = 287x more CPU usage doing actual work
which gives a pretty straightforward story - forkrun is 300x faster here because it is utilizing 300x more CPU for actually doing work.
This regime of "high frequency low latency tasks" - millions or billions of tasks that make milliseconds or microseconds each - is the regime where forkrun excels and tools like parallel fall apart.
Side note: if I bump it to 100 million newlines:
time { frun : <f1; }
real 0m4.212s
user 1m52.397s
sys 0m1.019s
The keep improving part hasn't done so well in 10 years already? Maybe this year the new force-fed AI answers got a bit useful, but many times the risk of hallucination means you still have to go and read a more credible source.
reply