This is a very good article, written well for an audience that might not understand the details of memory layouts.
Also, in the 0x0x11223344556677889900aabbccddeeff sample value they're using, what is this 0x0x format? Is this some some low level asm thing, or is it just a formatting typo?
One thing you can run into with these types in C/C++ is that the compiler assumes these types are aligned (unless you use some specific compiler attributes) and generates accesses that require alignment (e.g. MOVDQA) instead of ones that don't (MOVDQU). This is problematic if you have some custom allocator that (incorrectly) only provides 8-byte alignment, and cast allocated pointers to pointers to this 16-byte type (or a struct containing it). Not a Rust problem at all, just something to be careful of.
I found the LLVM/Clang bugs mentioned in the article kind of fascinating. As far as I know Clang has supported these types (partially) for quite a long time, so it's interesting that these issues weren't fixed until quite recently (if at all?).
Yes: OSX on ARM64 has alignof(max_align_t)==8 and alignof(__int128)==16. And malloc only provides 8-byte alignment, so if you're using types that require 16-byte alignment, you have to call aligned_alloc. Also, the stack is only 8-byte aligned.
Despite the alignof, int128 doesn't actually require 16-byte alignment on ARM64, so nothing goes wrong when you have an int128 within an 8-byte aligned struct.
Some x86-64 SIMD types do require 16-byte alignment (depending on what instructions you load them with). The Eigen math library, for instance, is slightly faster when you tell it to assume everything is 16-byte aligned, but you have to do some work to guarantee that's true. As well as calling aligned_alloc, you have to avoid locals since the stack is only 8-byte aligned.
> Yes: OSX on ARM64 has alignof(max_align_t)==8 and alignof(__int128)==16. And malloc only provides 8-byte alignment, so if you're using types that require 16-byte alignment, you have to call aligned_alloc. Also, the stack is only 8-byte aligned.
Huh? Stack alignment is 16B ("The stack pointer on Apple platforms follows the ARM64 standard ABI and requires 16-byte alignment." https://developer.apple.com/documentation/xcode/writing-arm6...) and memory returned by the system malloc is always 16B aligned (it was 16B aligned even on 32b x86 and ARM).
I think that the two-column compatibility table would be a lot more readable as a 2D compatibility matrix. I'm having difficulty getting a feel for what's compatible with what using the current layout.
In case anyone wants to try the same, my assumption is that they specified `#[repr(packed(8))]` on some structs to use alignments and paddings of at most 8 bytes. Using `[repr(align(8))]` on the field should also work, if you want more fine-grained control.
> my assumption is that they specified `#[repr(packed(8))]` on some structs to use alignments and paddings of at most 8 bytes
That would mean taking a reference of any field of that struct is undefined behavior, unless the compiler does some special magic for this specific case.
> Using `[repr(align(8))]` on the field
I don't think you can do that directly. You would need to use a newtype that specified the alignment, and use that type for the field.
That would make sense, but I would like to know in which situations that happens. And can I manually reduce the alignment in a struct (say to an alignment of 8 with a struct that has other fields with alignment of 8) to reduce memory usage? Or ensure that a location uses an alignment of 16 in places I want the higher performance.
What do you do with a 128bit integer? I already kind of consider 64 bit to be infinite.
Maybe some performance optimization hack where you can use a big integer instead of a float for faster math? Skip right past 64bit unix time and give yourself a lot of breathing room?
When I worked on S3, I was briefly responsible for reporting waste in the system. The basic equation was [Total Capacity of Hard Drives] - [# of bytes customers are paying for] * [Replication factor] = Waste
One week as I was preparing the report, it was clear something had gone haywire. Waste was roughly equal to total capacity. So either we'd lost all of our customers overnight, or there was a bug. Turns out the legacy billing system was using a long to count # of paid bytes and this had overflowed. So it does happen.
This is a classic example of why overflow instead of being defined (as some people want) as wrapping (since that's what the CPU naturally does for basic arithmetic), it should be an error, and if you didn't handle it (and in many cases you hadn't even realised it could happen so why would you?) thus fatal.
If the report runs and says something like "Fatal: Overflow while multiplying customer_space * repl_factor. Consider floating point numbers or a larger integer type" - you'd go "Oops" fix the bug and run it again.
I needed 128-bit and 256-bit integers on an embedded project recently.
In short, it was for fixed-point digital signal processing. The raw input and output samples were int64_t. We needed to add, subtract, multiply, and accumulate these to do filtering and linear regression with no loss of precision.
Conventional bigintegers weren't an option because the target application doesn't allow heap allocation. So we rolled our own [1] stack-allocated, fixed-width big integer class.
In this case, the target platform has 256 kiB of RAM, no hardware floating point, and no operating system. I could be wrong, but Boost did not look like a good fit.
It's a nice scalar representation of a UUID, for example. Also it makes a nice internal state for a fast 64-bit PRNG (e.g., any 128/64-bit LCG/MCG, such as pcg64).
I have used 128 bit (and even 256 bit) numbers during Advent of Code when I was too lazy to optimize my algorithm and found that the puzzle input didn't exceed the 64 bit space by enough to look into arbitrary sized integers or better algorithmic solutions.
There are also some data formats that are 128 bit, like others mentioned.
Depending on the application, going for floats can be a huge mistake: you only exact integer math up to 2^52 or 10^15. That’s hardly infinite, and insufficient for many applications.
If you need to do multiplication of 64-bit values of unknown ranges you have no choice but to store the result in 128 bits, even if it's only an intermediate result.
I don't understand why align 128-bit values to 16 bytes and waste precious memory if CPUs read and process data in 8-byte chunks anyway. Or is alignment necessary for SSE insructions? But SSE doesn't work with 128-bit integers.
Also, if program uses lot of memory (due to alignment) it can cause swapping and the performance will be much worse than with unaligned storage.
The CPU works on 64-bits. But the memory works on 512-bit / 64 byte cache lengths or DDR4 bursts.
An unaligned access could be across 2 cache lines or 2 DDR Bursts. Or across a page table (I think 4096-bytes??) Or other higher level of organization requiring multiple accesses under the hood.
X86 does the easy thing and takes two or more clock ticks to read. But aligned accesses remain faster, likely for these low level groupings of bytes.
Some architectures straight up do not allow unaligned reads and force the programmer to read two registers and then extract the unaligned data.
-------
In particular, reading or writing across two cache lines can be disastrous loss of efficiencies. The memory system needs to lock / MESI indicate both cache blocks. If one (or the other) memory location is still locked by another thread / CPU core, you'll get false sharing.
As such, a common practice is to 64-byte / 512-bit align your memory accesses in heavily multithreaded code.
Note: unaligned doesn't always mean cache-line splitting. Unaligned 64-bit loads & stores within a cache line incur no performance penalty on modern Intel architectures IIRC.
Also, reading a cache-line in shared MESI state should not cause false sharing degradation, only writing.
Unaligned doesn't always mean cache-line splitting, but aligned means never splitting cache lines.
Also, you're correct that aside from caching effects, unaligned loads and stores generally carry no penalty on x86. This is one of the smarter (IMO) things that Intel/AMD have done to keep x86's market share.
It can still penalize another writer by forcing its exclusive line into shared. It is not as expensive as a concurrent writes but it is a significant measurable overhead to what would otherwise be a write private line.
And sometimes 128-byte because the some part of the cache system is reading two cache lines at a time in contemporary x86-64 if I understand correctly.
It was a really big problem for early x86-64 CPUs, I believe. My hearsay recollection is that the cache prefetcher would pretty much always load pairs of adjacent cache lines.
> Some architectures straight up do not allow unaligned reads and force the programmer to read two registers and then extract the unaligned data.
That's largely obsolete for architectures that still get used now that individual transistors are cheap, since it turns out that it's much more efficient to do it in silicon, even if such accesses are pretty rare.
But the probability that 16-byte value will be split over two 64-byte cache lines is not very high (I think it is 1/8). Is it worth optimizing for an unlikely case?
Doing 128bit atomics with CMPXCHG16B requires 16 byte alignment, and AFAIK that is one of the more common uses of 128bit types in practice since it's used in certain concurrency primitives to avoid the ABA problem.
Wouldn't it be better to expose 128-bit vector registers for this purpose? Like how the aarch64 module exposes int16x8_t. That seems much better than relying on a generic 128 bit type, because the use-case is clearly specified.
Specifically for CMPXCHG16B, the normal case is for this to actually be two separate numbers (usually an 8-byte pointer and an ABA counter) stored in memory as a pair.
For a lot of 128-bit arithmetic, it's also better to use the integer registers to take advantage of the ADC/ADCX/ADOX and MULX instructions for basic arithmetic operations. It actually depends a lot on the operation chain.
SSE instructions are not all that uncommon with bignums (including 128-bit types), but generally avoiding your structs crossing cache lines is very useful for performance. There are also instructions like CMPXCHG16B that need 16 bytes and are a lot worse if they cross cache lines.
IMO it's actually a pretty big performance bug for 128-bit ints to not be 16-byte aligned.
"Worse" is a slight understatement. When not disabled by the OS, cacheline crossing RMW are 4-5 orders of magnitude slower than normal and affect all CPUs in a system.
The "A" in MOVDQA stands for "aligned," and requires 16-byte alignment. The corresponding MOVDQU does not require alignment but is marginally slower, at least on older CPUs.
On modern CPUs MOVDQU is still slower, but only if the data is unaligned and straddles two cachelines, if the data is properly aligned than MOVDQU and MOVDQA perform identically nowadays. It's still important to align data but it doesn't matter so much whether you use the aligned instructions. I suppose using the aligned instructions still gives you a free assertion that data you expect to be aligned is actually aligned, rather than silently running slower if it's not.
IIRC at least one compiler (maybe ICC?) no longer bothers to emit aligned instructions at all, even if it knows the data should be aligned, because they found it to be more trouble than it's worth on modern hardware.
IIRC MOV*A and MOV*U differ pretty significantly in how the secret stuff in the cache prefetching system treats them. The assertion is kind of nice, but there is a difference.
In general, Rust fields are padded such that they aligned to a multiple of their size. Rustc does not offer guarantees on the ordering of the fields. This holds for all types, not just 128bit values.
This allows rust programs to actually use less memory than C programs.
That's a special case and sits at the boundaries / bridge of languages.
> There is no indirection for these types; all data is stored within the struct, as you would expect in C. However with the exception of arrays (which are densely packed and in-order), the layout of data is not specified by default.
struct A {
a: i32,
b: u64,
}
struct B {
a: i32,
b: u64,
}
> Rust does guarantee that two instances of A have their data laid out in exactly the same way. However Rust does not currently guarantee that an instance of A has the same field ordering or padding as an instance of B.
It’s not the default, but it’s just a language feature. It’s also not just for interfacing with C. For example, you also use repr(C) for pointer tricks even if you never leave Rust. Lots of places in the standard library use it for that reason.
> By default, composite structures have an alignment equal to the maximum of their fields' alignments. Rust will consequently insert padding where necessary to ensure that all fields are properly aligned and that the overall type's size is a multiple of its alignment. [...]
> There is no indirection for these types; all data is stored within the struct, as you would expect in C. However with the exception of arrays (which are densely packed and in-order), the layout of data is not specified by default.
struct A {
a: i32,
b: u64,
}
struct B {
a: i32,
b: u64,
}
> Rust does guarantee that two instances of A have their data laid out in exactly the same way. However Rust does not currently guarantee that an instance of A has the same field ordering or padding as an instance of B.
That's right, you can pack things with a pragma directive in C and C++ for some compilers.
Rust can take it a bit further. In rust, fields must be aligned to a multiple of their size, and structs/tuples must be aligned to a multiple of their largest element. This combination makes it very easy to handle memory offsets, or reusing the same representation in a way that reduces memory usage (these are called niches)[1].
This is why dynamic linking is a bit more complex in rust. Alternatively you can offer a C bridge via `extern "c"`. C will act as a bridge between rust and different languages, and at that stage you need to take care of ordering and padding.
As the article explains this is purely about about compatible calling conversations between rust and C. I guess your question is also directed C compiler implementations that even overwrite LLVM defaults to achieve this alignment.
It's a C type, it's not in the standard, but there are a lot of types that aren't in the standard. And of course, it would have been in the standard long ago if they hadn't made a terrible mistake in C99 with intmax_t[0].
Elaborating on the problems - _BitInt(128) has an alignment of 8, meaning that C's _BitInt(128) and __int128 are incompatible, similar to the Rust-C incompatibility that was just fixed :(.
It's actually worse than that; from what I can piece together of the history:
1. the people at Intel who originally implemented `_BitInt()` made `_BitInt(128)` eight-byte aligned
2. the x86_64 psABI document was updated to agree with that
3. it was implemented in clang in such a way that it also made `_BitInt(128)` eight-byte aligned on arm64 ...
4. ... but the AAPCS says that it's sixteen-byte aligned
5. ... and maybe the x86_64 psABI is going to change to say that it's sixteen byte aligned after-all.
So--as of a month ago when I was studying this because we're in the middle of enabling `[U]Int128` in Swift--_BitInt(128) didn't agree with __int128[_t] on either platform, and that's definitely a bug on arm64, and, although it behaves as documented on x86_64, that also might change.
LLVM had multiple ABI conformance bugs in its implementation of int128,
but clang had workarounds for these bugs that rustc lacked, causing interop problems. I wonder why they put workarounds in clang instead of fixing LLVM?!
This is a great example of one of the criticisms of LLVM, that language front ends must still implement processor-specific details themselves – iirc another example is struct layout.
Yes, LLVM has many bugs that Clang either doesn’t hit or has actively worked around, sometimes for years. There was never a huge push to fix that situation, which made using LLVM for other languages a disappointing experience. Rust has uncovered many of those bugs over time, and it seems like Rust is an important enough customer that those do get fixed now. Slowly.
For example, I remember that for many years, Rust would have to throw away part of its aliasing analysis because its annotations were too precise and would confuse LLVM’s optimizer. And so on. But it’s getting better.
> I wonder why they put workarounds in clang instead of fixing LLVM?!
Among other things, because it would probably break rust (and other projects that had come to rely on the mistaken representation). Interoperability bugs like this often end up fundamentally unfixable in practice and the community ends up having to embrace bifurcated standards.
And this is the kind of convoluted situation that happens when you have software that can never, ever break backwards compatibility.
It is probably entirely necessary to do it in this particular situation, but when people argue for no software ever breaking backward compatibility, they're arguing for more and more convoluted solutions like this. There's always a cost.
> And this is the kind of convoluted situation that happens when you have software that can never, ever break backwards compatibility.
Toolchains are poster children for that genre, yeah. The ecosystems are too large and too interconnected to evolve in parallel. I mean, any one of LLVM, Rust and Clang are too large as a single project to implement a breaking change like this as a single event, trying to get them to do it in tandem just isn't going to happen. And even then there are uncounted other LLVM downstreams to worry about, all of whom need a story for how to migrate.
I just want to get the people who scream on HN about how software should never ever break backcompat under any circumstance, in a room with the people on this issue who are stunned by how convoluted this situation is, and get them to fight it out. Bonus points for when those people are the same person on different days.
That’s one of the links in the original article, which postdates the workaround in clang. It doesn’t explain why the incorrect ABI was previously worked around in clang instead of fixed in LLVM.
Also, in the 0x0x11223344556677889900aabbccddeeff sample value they're using, what is this 0x0x format? Is this some some low level asm thing, or is it just a formatting typo?