I also have a background in C/C++, etc and I've only ever found myself missing v...

codeflo · on Nov 14, 2022

> perhaps the problem is implicit allocations rather than value semantics?

I think that’s true. Expensive copies should never have been implicit. There was a story some time ago about a single keypress in the address bar of Chrome causing thousands of memory allocations. The culprit: lots of std::string arguments up and down the call stack.

Rust gets this right, with the hindsight of C++’s example: “a = b” is a move operation by default and clone() is always explicit, except for plain data types where copying is literally memcpy — and those are clearly marked as such by the type system.

cbsmith · on Nov 14, 2022

IMHO, implicit allocations is a bit of a red herring. Yes, in C/C++ heap allocations are proportionately pretty expensive, but I've seen Java programs have just ridiculous amounts of implicit allocations but there really isn't much of a problem.

But allocations aren't the same as copies, and the argument for reference semantics has always been that implicit copies are problematic. In your std::string example, having that many String copies in a Java program would be similarly terrible (and this sometimes happens by accident because of abstraction layers that hide all the copying going on under the covers).

I do think Rust gets a lot of stuff right, but Rust's cognitive load is broadly recognized. I tend to see it as C++ with a lot fewer foot guns. ;-)

throwaway894345 · on Nov 14, 2022

> Yes, in C/C++ heap allocations are proportionately pretty expensive, but I've seen Java programs have just ridiculous amounts of implicit allocations but there really isn't much of a problem.

Java programs make "ridiculous amounts of implicit allocations" because allocations are cheap in Java. And they need to be cheap because Java doesn't have value semantics so it leans hard on escape analysis + cheap allocations.

I agree with the rest of your comment, although I think most of Rust's "cognitive load" amounts to borrow-checker-vs-garbage-collection. You could envision a Rust with explicit allocations and a GC, and that language would have a "cognitive load" approaching that of Go while also being a fair bit more performant insofar as people can much more easily reason about allocations and thus performance.

cbsmith · on Nov 14, 2022

> Java programs make "ridiculous amounts of implicit allocations" because allocations are cheap in Java. And they need to be cheap because Java doesn't have value semantics so it leans hard on escape analysis + cheap allocations.

Yes, but that's kind of the point, right? Implicit allocation isn't really a problem because a runtime that optimizes the allocations magically for you is a lot easier to build than a runtime that optimizes whether you really need to be copying objects as much as you do.

throwaway894345 · on Nov 14, 2022

> Implicit allocation isn't really a problem because a runtime that optimizes the allocations magically for you is a lot easier to build

As far as I know, Java's (default) runtime gives cheap allocations at the cost of long GC pause times.

> than a runtime that optimizes whether you really need to be copying objects as much as you do

It's not "copying", it's "allocating", and avoiding allocations isn't that much work (and frankly I'm surprised it's such a minor problem that no one has bothered to build an IDE plugin that highlights these allocation points automatically--or at least I haven't heard of such a thing). Anyway, "a runtime that minimizes allocations" is just an escape analyzer and Java has one of these too, and IIRC it's a lot more sophisticated than Go's (but it's also a lot harder to reason about as a consequence).

cbsmith · on Nov 14, 2022

> As far as I know, Java's (default) runtime gives cheap allocations at the cost of long GC pause times.

"long GC pause times" is kind of vague, so I guess you could be correct, but in practice there's a LOT of different ways the memory management can be handled, many of which are deemed "pauseless GC" (though the term is somewhat misleading).

My statement was considering that reality though. While not true for some use cases, in the vast majority of cases, the runtime optimizes the allocations more than sufficiently.

> It's not "copying", it's "allocating"

Allocators can do a pretty good job of minimizing the overhead of allocation, to the point the amortized cost isn't much more than a single machine instruction. Allocating gigabytes of memory quickly is possible. Copying the data can be a lot more work, and often objects have copy semantics that add a lot more additional work.

> Anyway, "a runtime that minimizes allocations" is just an escape analyzer and Java has one of these too, and IIRC it's a lot more sophisticated than Go's (but it's also a lot harder to reason about as a consequence).

I think you're implicitly saying "a runtime that minimizes heap allocations" there, in which case I'd agree.

throwaway894345 · on Nov 14, 2022

> in practice there's a LOT of different ways the memory management can be handled, many of which are deemed "pauseless GC" (though the term is somewhat misleading).

Yes, but I'm pretty sure those "pauseless GC" schemes impose other tradeoffs.

> My statement was considering that reality though. While not true for some use cases, in the vast majority of cases, the runtime optimizes the allocations more than sufficiently.

I'm not sure I follow. The same could be said for Go--in the vast majority of cases, Go's tradeoffs (slow allocations, low latency / non-moving GC) are also suitable.

> Allocators can do a pretty good job of minimizing the overhead of allocation, to the point the amortized cost isn't much more than a single machine instruction.

As far as I know, speeding up allocations to this degree requires a moving GC which imposes a bunch of other constraints (including copying a bunch of memory around).

> Allocating gigabytes of memory quickly is possible. Copying the data can be a lot more work, and often objects have copy semantics that add a lot more additional work.

Yes, but the bottleneck here wasn't the copying, it was the allocations. And if you optimized away allocation cost entirely such that only the copy cost remained, that cost would be so small that the OP would never have bothered to profile because copying small objects like this is so cheap compared to everything else (even if it is expensive compared to bump allocating).

> I think you're implicitly saying "a runtime that minimizes heap allocations" there, in which case I'd agree.

Yes, the allocator and GC are concerned with heap allocations and not stack allocations. I'm using "allocations" as a shorthand for "heap allocations".

cbsmith · on Nov 14, 2022

In hindsight, I think I chose how to present this poorly, because yes, in this case, the allocation is what is killing the performance. I look at it, and I just see unnecessary implied behaviour creating a performance problem. Usually it isn't the allocations themselves that kill you, but it certainly is the case here.

throwaway894345 · on Nov 14, 2022

I agree with you (I think?) that the implicit allocations are a pain point. I think in the Go case, it is the allocations that kill you most of the time (at least that's the case for me), but in C++ it's more likely to be expensive copy constructors or destructors or so on.

jhoechtl · on Nov 14, 2022

A long long time ago Rust was a GC language.

pjmlp · on Nov 15, 2022

OCaml and Standard ML.

morelisp · on Nov 14, 2022

Allocations are as damaging as your free function is slow.

Java has a tremendously good GC, so can cope with lots of allocations. Go has an OK one, so needs some help (but mollifying it often pays dividends elsewhere in locality and memory usage too). C++ has your default system heap, good luck.

throwaway894345 · on Nov 14, 2022

Historically Java has traded long pause times for fast allocations, although I'm of the impression that it has recently found a way to have its cake and eat it.

klodolph · on Nov 14, 2022

Java has been tunable for a long time. Periodically, the recommended tuning changes, or new GC algorithms become available, etc. But it has long been possible to get short pause times with various combinations of choosing the right algorithm and writing your program the right way.

I think what really throws people off here is that getting good performance out of a Java application involves some skills which are alien to C++ programmers, and vice versa. You take an experienced C++ programmer and drop them into a Java codebase, they may have a very poor sense of what is expensive and what is cheap. Vice versa… experienced Java programmers don’t do well in C++ either.

The result is that you have precious few people with any significant, real-world experience fixing performance issues in both languages.

throwaway894345 · on Nov 14, 2022

Agreed, but usually tuning for short pause times involves trading off throughput or allocation performance. But at the end of the day, if you aren't allocating a bunch of garbage in the first place, then you don't need to be as concerned about the costs of allocating or cleaning up the garbage. I wish Go did more to make allocations explicit so they could be more easily recognized and avoided; I dislike Java's approach of making allocations even more implicit/idiomatic while trying to fix the cost problem in the runtime (although I admire the ambition).

SpaghettiCthulu · on Nov 14, 2022

Parallel garbage collection in Java has been a thing for a long time. With the right tuning you might have very infrequent STW GCs, even in a game as allocation-heavy as Minecraft.

lazide · on Nov 14, 2022

What do you consider a ‘long’ pause time?

I’ve had no issues with Java 17+ under heavy allocation/garbage collection (data encryption pipeline I haven’t tuned to reuse buffers yet), and it’s pause times are on the order of a handful of milliseconds, without meaningful tuning. I think it’s doing something like a GB/s of garbage collection.

And the jvm in question is doing a LOT more than just this, so it’s coping with millions of allocated objects at the same time.

throwaway894345 · on Nov 14, 2022

I consider tens of milliseconds to be a long pause time (P99 should be <10ms), which is what I understand to be the ballpark for G1 (I haven't tested 17+). No doubt JVM is a fine piece of engineering, but I prefer Go's approach of just not allocating as much to begin with (Java's cheap allocations require a moving GC, which introduces challenges with respect to rewriting pointers and so on, which in turn introduces constraints for FFI and makes the whole system more complex). I wish it went further and made allocations more explicit.

pjmlp · on Nov 15, 2022

If one cares about pause times, G1 isn't it, rather the pauseless ZGC, Azul's C4 or Shenodah.

Capable of handling TB sized heaps with micro seconds pauses.

throwaway894345 · on Nov 15, 2022

Yeah, I’m nominally familiar with these, but I can’t understand why one of these low latency collectors wouldn’t be the default GC unless they impose other significant tradeoffs.

lazide · on Nov 15, 2022

They don’t change it from the default because G1 isn’t terrible and Java has a long history of ensuring backwards compatibility - not just in APIs, but also behavior, warts and all.

If they changed the default GC, a lot of folks would freak out, even if it was objectively better in nearly every situation. Because someone, somewhere is relying on some weird bug in G1, or whatever and now their software behaves differently, and it’s a problem.

Give it a couple more major revs, and it might still happen. G1 became the default in what, Java 9?

pjmlp · on Nov 15, 2022

Naturally there are tradeoffs, for example the amount of infrastruture they need only justifies when one is working at such scale, for example ZGC requires large pages and makes use of NUMA if available.

throwaway894345 · on Nov 15, 2022

Sure, but isn't Shenandoah more "general purpose" or whatever?

codeflo · on Nov 14, 2022

I’m not sure I get what you mean. You wouldn’t have that many String copies in Java by passing an unchanged String down the call stack. My point is that it’s too easy to make this mistake in C++.

cbsmith · on Nov 14, 2022

In Java, the mistake happens only when there's an abstraction that hides the copying from you, so it isn't implicit in the same way, but it's still implicit.

adrian17 · on Nov 15, 2022

> Rust gets this right, with the hindsight of C++’s example: “a = b” is a move operation by default and clone() is always explicit

Note that a move can still do a copy; in fact, Rust is kinda notorious for generating more on-stack memory copy operations than C++. It’s slowly improving, but it can still be surprisingly bad in some cases.

xdavidliu · on Nov 14, 2022

> except for plain data types where copying is literally memcpy

what do you mean by this? If I say `let x = 5; let y = x;` in rust, that's a "plain data type copy" of a stack value, but memcpy is usually used to copy heap memory. What connection between copying of primitive simple stack values and memcpy are you suggesting here?

kccqzy · on Nov 14, 2022

The compiler can optimize memcpy with a known size into a small number of move instructions so they are identical to copying stack values.

Try playing with memcpy on Godbolt and you'll find that the compiler will compile the memcpy to a single mov instruction when the size is small, and some movdqu/movups when the size is slightly large, and only a function call when the size is huge.

> memcpy is usually used to copy heap memory

memcpy is often used in low-level serialization / deserialization code since you can't just cast a buffer pointer to a uint32_t pointer and dereference that; the solution is memcpy between variables that are often both on the stack.

orra · on Nov 14, 2022

> What connection between copying of primitive simple stack values and memcpy are you suggesting here?

They're just using 'memcpy' as a shorthand for saying the bitpattern is blitted. Semantically, that's like a memcpy. The point is, there are no expensive allocations, nor do any embedded pointer fields need adjusted, etc.

chlorion · on Nov 15, 2022

Why do you think memcpy is normally used to copy heap memory? It's just a general bitwise copy from one location to the other.

I think the confusion here is that there isn't always a literal call to memcpy for copying small types like ints in the emitted code, but it's always doing something with the same effect and maybe sometimes using an actual memcpy (probably when copying arrays?).

Also something interesting is that memcpy is used for copying data between stack variables in C sometimes when you need to convert some type to another one without using a cast.

throwaway894345 · on Nov 14, 2022

Allocating isn't "an expensive copy"; it's not analogous to clone() in Rust. The copy isn't the problem, it's the allocation.

cbsmith · on Nov 14, 2022

I'd argue quite the reverse. Allocation can be quite efficient if done properly, but copying involves a lot of other work.

throwaway894345 · on Nov 14, 2022

I disagree--the bottleneck here is entirely the allocation. The copying is just a memcpy and it's very fast for small structs like this; like I said, it's not the same as a clone() in Rust, which is a deep copy. If you optimized the allocation away entirely (leaving only the copy cost), there wouldn't have been a significant performance problem and this blog would never have been written.

glandium · on Nov 15, 2022

Actually, you'll find that in Rust, Box::new(stuff) will too often put stuff on the stack before copying it in the newly allocated memory. For large enough stuff, that can be slower than the allocation.

cbsmith · on Nov 14, 2022

> I've only ever found myself missing value semantics when I use languages with implicit reference semantics.

Oh, I miss it every time. ;-)

I will say though that some newer languages seem to have a confused idea about how to offer mixed semantics. A bunch of them tie semantics to types. The ideal interface can vary by usage context. It's hard enough getting the semantics right as the callee (as opposed to caller), let alone when you're defining a type that will be used by who knows how many interfaces.

> I guess I always figured the solution was "value semantics with better education / tooling".

I've always thought much the same, but I have slowly come to appreciate that it's more than just education & tooling. Even with good education & tooling, there's a cognitive load that comes with getting interfaces right that for the general case is just not worth it.

adgjlsfhk1 · on Nov 14, 2022

I think this is half right. For anything 64 bits or smaller, value semantics are pretty much always going to be better. That said, being able to choose between value and reference semantics for larger objects per object is a pretty useful feature.

cbsmith · on Nov 14, 2022

> For anything 64 bits or smaller, value semantics are pretty much always going to be better.

That's assuming a 64-bit CPU (which admittedly seems like a reasonable assumption. The nice thing about the abstraction though is that there's nothing preventing the runtime from applying value semantics for those trivial small-object cases where they're obviously more efficient.

account42 · on Nov 15, 2022

Even for a 32-bit CPU a 64-bit type is only two words to copy - and in many cases those "copies" are just register loads. In contrast, reference types means to even access it you have to read the reference and then indirectly load the memory it points to. You have to really make something contrived where a two-word type ends up being more efficient as a reference than as a value.

throwaway894345 · on Nov 14, 2022

> I will say though that some newer languages seem to have a confused idea about how to offer mixed semantics. A bunch of them tie semantics to types.

Curious about what you mean here. This sounds like C#'s class/struct distinction to me.

cbsmith · on Nov 14, 2022

That's exactly the example I was thinking of.

throwaway894345 · on Nov 14, 2022

Yeah, I never cared for that. Specifically, I'd prefer that everything was just "struct", but structs could implement interfaces, which is essentially the Go/Rust model.

TremendousJudge · on Nov 14, 2022

>or perhaps the problem is implicit allocations rather than value semantics

To me, this sounds like this is it. Explicit is better than implicit is a very useful truism

cbsmith · on Nov 14, 2022

The counter argument to the "explicit is better than implicit" is that abstraction & encapsulation are such significant force multipliers. If done properly, implicit is good. It's just that in case of copying, doing it "properly" is well nigh impossible.

kazinator · on Nov 14, 2022

          explicit  implicit

  good       *    <    *
 
             v    v    v

  bad        *    >    *

Good implicit is better than good explicit. (If all is good, go for implicit.)

Bad explicit is better than bad implicit. (If all is bad, go for explicit; don't hide bad explicit with bad implicit.)

Good explicit or implicit is better than bad explicit or implicit.

ok123456 · on Nov 14, 2022

> Tooling: imagine an IDE that highlights allocation points automatically

Rider does this already for C#.

evntdrvn · on Nov 14, 2022

The JetBrains IDEs can do this, at least for .NET

pjmlp · on Nov 15, 2022

VS did it first with Roslyn plugins.