Layout of Rust's u128 and i128 changed

vsnf · on March 31, 2024

This is a very good article, written well for an audience that might not understand the details of memory layouts.

Also, in the 0x0x11223344556677889900aabbccddeeff sample value they're using, what is this 0x0x format? Is this some some low level asm thing, or is it just a formatting typo?

paavohtl · on March 31, 2024

It's a typo.

loeg · on March 31, 2024

One thing you can run into with these types in C/C++ is that the compiler assumes these types are aligned (unless you use some specific compiler attributes) and generates accesses that require alignment (e.g. MOVDQA) instead of ones that don't (MOVDQU). This is problematic if you have some custom allocator that (incorrectly) only provides 8-byte alignment, and cast allocated pointers to pointers to this 16-byte type (or a struct containing it). Not a Rust problem at all, just something to be careful of.

I found the LLVM/Clang bugs mentioned in the article kind of fascinating. As far as I know Clang has supported these types (partially) for quite a long time, so it's interesting that these issues weren't fixed until quite recently (if at all?).

acuozzo · on March 31, 2024

Are there instances of compilers with targets having both 128-bit ints and alignof(max_align_t) != 16?

I'm asking because I can't imagine writing a custom allocator against anything other than alignof(max_align_t).

Unless you're solely targeting <= C99… why use anything else?

tlb · on April 1, 2024

Yes: OSX on ARM64 has alignof(max_align_t)==8 and alignof(__int128)==16. And malloc only provides 8-byte alignment, so if you're using types that require 16-byte alignment, you have to call aligned_alloc. Also, the stack is only 8-byte aligned.

Despite the alignof, int128 doesn't actually require 16-byte alignment on ARM64, so nothing goes wrong when you have an int128 within an 8-byte aligned struct.

Some x86-64 SIMD types do require 16-byte alignment (depending on what instructions you load them with). The Eigen math library, for instance, is slightly faster when you tell it to assume everything is 16-byte aligned, but you have to do some work to guarantee that's true. As well as calling aligned_alloc, you have to avoid locals since the stack is only 8-byte aligned.

stephencanon · on April 1, 2024

> Yes: OSX on ARM64 has alignof(max_align_t)==8 and alignof(__int128)==16. And malloc only provides 8-byte alignment, so if you're using types that require 16-byte alignment, you have to call aligned_alloc. Also, the stack is only 8-byte aligned.

Huh? Stack alignment is 16B ("The stack pointer on Apple platforms follows the ARM64 standard ABI and requires 16-byte alignment." https://developer.apple.com/documentation/xcode/writing-arm6...) and memory returned by the system malloc is always 16B aligned (it was 16B aligned even on 32b x86 and ARM).

tlb · on April 1, 2024

You are correct. I think I was misremembering problems caused by SIMD types that require 32- or 64-byte alignment.

Karellen · on March 31, 2024

I think that the two-column compatibility table would be a lot more readable as a 2D compatibility matrix. I'm having difficulty getting a feel for what's compatible with what using the current layout.

ComputerGuru · on March 31, 2024

I was expecting this statement to be expounded:

> Unfortunately this meant some of the performance wins needed to be sacrificed to avoid an increased memory footprint.

loeg · on March 31, 2024

I think the idea is they might reduce the alignment in places to save memory.

wongarsu · on March 31, 2024

In case anyone wants to try the same, my assumption is that they specified `#[repr(packed(8))]` on some structs to use alignments and paddings of at most 8 bytes. Using `[repr(align(8))]` on the field should also work, if you want more fine-grained control.

https://doc.rust-lang.org/reference/type-layout.html#the-ali...

thayne · on March 31, 2024

> my assumption is that they specified `#[repr(packed(8))]` on some structs to use alignments and paddings of at most 8 bytes

That would mean taking a reference of any field of that struct is undefined behavior, unless the compiler does some special magic for this specific case.

> Using `[repr(align(8))]` on the field

I don't think you can do that directly. You would need to use a newtype that specified the alignment, and use that type for the field.

thayne · on March 31, 2024

That would make sense, but I would like to know in which situations that happens. And can I manually reduce the alignment in a struct (say to an alignment of 8 with a struct that has other fields with alignment of 8) to reduce memory usage? Or ensure that a location uses an alignment of 16 in places I want the higher performance.

fbdab103 · on March 31, 2024

What do you do with a 128bit integer? I already kind of consider 64 bit to be infinite.

Maybe some performance optimization hack where you can use a big integer instead of a float for faster math? Skip right past 64bit unix time and give yourself a lot of breathing room?

breckognize · on March 31, 2024

When I worked on S3, I was briefly responsible for reporting waste in the system. The basic equation was [Total Capacity of Hard Drives] - [# of bytes customers are paying for] * [Replication factor] = Waste

One week as I was preparing the report, it was clear something had gone haywire. Waste was roughly equal to total capacity. So either we'd lost all of our customers overnight, or there was a bug. Turns out the legacy billing system was using a long to count # of paid bytes and this had overflowed. So it does happen.

tialaramex · on March 31, 2024

This is a classic example of why overflow instead of being defined (as some people want) as wrapping (since that's what the CPU naturally does for basic arithmetic), it should be an error, and if you didn't handle it (and in many cases you hadn't even realised it could happen so why would you?) thus fatal.

If the report runs and says something like "Fatal: Overflow while multiplying customer_space * repl_factor. Consider floating point numbers or a larger integer type" - you'd go "Oops" fix the bug and run it again.

sanxiyn · on March 31, 2024

u64 overflows at 16 exabytes, so it sounds plausible. Cool story.

ooterness · on March 31, 2024

I needed 128-bit and 256-bit integers on an embedded project recently.

In short, it was for fixed-point digital signal processing. The raw input and output samples were int64_t. We needed to add, subtract, multiply, and accumulate these to do filtering and linear regression with no loss of precision.

Conventional bigintegers weren't an option because the target application doesn't allow heap allocation. So we rolled our own [1] stack-allocated, fixed-width big integer class.

[1] https://github.com/the-aerospace-corporation/satcat5/blob/ma...

fluoridation · on March 31, 2024

Was there a reason not to use boost::multiprecision?

ooterness · on April 1, 2024

In this case, the target platform has 256 kiB of RAM, no hardware floating point, and no operating system. I could be wrong, but Boost did not look like a good fit.

pclmulqdq · on March 31, 2024

Cryptography, UUIDs and some other hashing things, certain kinds of counters.

Also, 2^64 is only about 10^19, so if you happen to have 20 exabytes that you want to byte-address, you can't do it with 64-bit pointers.

loeg · on March 31, 2024

It's a nice scalar representation of a UUID, for example. Also it makes a nice internal state for a fast 64-bit PRNG (e.g., any 128/64-bit LCG/MCG, such as pcg64).

ainar-g · on March 31, 2024

IPv6, too. (Unless you also need interface IDs for link-local addresses.)

jeroenhd · on March 31, 2024

I have used 128 bit (and even 256 bit) numbers during Advent of Code when I was too lazy to optimize my algorithm and found that the puzzle input didn't exceed the 64 bit space by enough to look into arbitrary sized integers or better algorithmic solutions.

There are also some data formats that are 128 bit, like others mentioned.

cryptonector · on March 31, 2024

64 bits are barely enough to do financial calculations ranging from fractional pennies to the sum of all assets and liabilities in the world.

codeflo · on March 31, 2024

Depending on the application, going for floats can be a huge mistake: you only exact integer math up to 2^52 or 10^15. That’s hardly infinite, and insufficient for many applications.

fluoridation · on March 31, 2024

If you need to do multiplication of 64-bit values of unknown ranges you have no choice but to store the result in 128 bits, even if it's only an intermediate result.

saagarjha · on April 1, 2024

128-bit atomics are a common usecase.

ardel95 · on March 31, 2024

The article didn’t mention this, but don’t u128s get mapped to SSE2 registers on most modern x86_64 processors, and not regular 64-bit ones?

codedokode · on March 31, 2024

I don't understand why align 128-bit values to 16 bytes and waste precious memory if CPUs read and process data in 8-byte chunks anyway. Or is alignment necessary for SSE insructions? But SSE doesn't work with 128-bit integers.

Also, if program uses lot of memory (due to alignment) it can cause swapping and the performance will be much worse than with unaligned storage.

dragontamer · on March 31, 2024

The CPU works on 64-bits. But the memory works on 512-bit / 64 byte cache lengths or DDR4 bursts.

An unaligned access could be across 2 cache lines or 2 DDR Bursts. Or across a page table (I think 4096-bytes??) Or other higher level of organization requiring multiple accesses under the hood.

X86 does the easy thing and takes two or more clock ticks to read. But aligned accesses remain faster, likely for these low level groupings of bytes.

Some architectures straight up do not allow unaligned reads and force the programmer to read two registers and then extract the unaligned data.

-------

In particular, reading or writing across two cache lines can be disastrous loss of efficiencies. The memory system needs to lock / MESI indicate both cache blocks. If one (or the other) memory location is still locked by another thread / CPU core, you'll get false sharing.

As such, a common practice is to 64-byte / 512-bit align your memory accesses in heavily multithreaded code.

dist1ll · on March 31, 2024

Note: unaligned doesn't always mean cache-line splitting. Unaligned 64-bit loads & stores within a cache line incur no performance penalty on modern Intel architectures IIRC.

Also, reading a cache-line in shared MESI state should not cause false sharing degradation, only writing.

pclmulqdq · on March 31, 2024

Unaligned doesn't always mean cache-line splitting, but aligned means never splitting cache lines.

Also, you're correct that aside from caching effects, unaligned loads and stores generally carry no penalty on x86. This is one of the smarter (IMO) things that Intel/AMD have done to keep x86's market share.

gpderetta · on March 31, 2024

It can still penalize another writer by forcing its exclusive line into shared. It is not as expensive as a concurrent writes but it is a significant measurable overhead to what would otherwise be a write private line.

kzrdude · on March 31, 2024

And sometimes 128-byte because the some part of the cache system is reading two cache lines at a time in contemporary x86-64 if I understand correctly.

loeg · on March 31, 2024

It was a really big problem for early x86-64 CPUs, I believe. My hearsay recollection is that the cache prefetcher would pretty much always load pairs of adjacent cache lines.

o11c · on March 31, 2024

> Some architectures straight up do not allow unaligned reads and force the programmer to read two registers and then extract the unaligned data.

That's largely obsolete for architectures that still get used now that individual transistors are cheap, since it turns out that it's much more efficient to do it in silicon, even if such accesses are pretty rare.

codedokode · on March 31, 2024

But the probability that 16-byte value will be split over two 64-byte cache lines is not very high (I think it is 1/8). Is it worth optimizing for an unlikely case?

jsheard · on March 31, 2024

Doing 128bit atomics with CMPXCHG16B requires 16 byte alignment, and AFAIK that is one of the more common uses of 128bit types in practice since it's used in certain concurrency primitives to avoid the ABA problem.

dist1ll · on March 31, 2024

Wouldn't it be better to expose 128-bit vector registers for this purpose? Like how the aarch64 module exposes int16x8_t. That seems much better than relying on a generic 128 bit type, because the use-case is clearly specified.

pclmulqdq · on March 31, 2024

Specifically for CMPXCHG16B, the normal case is for this to actually be two separate numbers (usually an 8-byte pointer and an ABA counter) stored in memory as a pair.

For a lot of 128-bit arithmetic, it's also better to use the integer registers to take advantage of the ADC/ADCX/ADOX and MULX instructions for basic arithmetic operations. It actually depends a lot on the operation chain.

pclmulqdq · on March 31, 2024

SSE instructions are not all that uncommon with bignums (including 128-bit types), but generally avoiding your structs crossing cache lines is very useful for performance. There are also instructions like CMPXCHG16B that need 16 bytes and are a lot worse if they cross cache lines.

IMO it's actually a pretty big performance bug for 128-bit ints to not be 16-byte aligned.

gpderetta · on March 31, 2024

"Worse" is a slight understatement. When not disabled by the OS, cacheline crossing RMW are 4-5 orders of magnitude slower than normal and affect all CPUs in a system.

loeg · on March 31, 2024

The "A" in MOVDQA stands for "aligned," and requires 16-byte alignment. The corresponding MOVDQU does not require alignment but is marginally slower, at least on older CPUs.

jsheard · on March 31, 2024

On modern CPUs MOVDQU is still slower, but only if the data is unaligned and straddles two cachelines, if the data is properly aligned than MOVDQU and MOVDQA perform identically nowadays. It's still important to align data but it doesn't matter so much whether you use the aligned instructions. I suppose using the aligned instructions still gives you a free assertion that data you expect to be aligned is actually aligned, rather than silently running slower if it's not.

IIRC at least one compiler (maybe ICC?) no longer bothers to emit aligned instructions at all, even if it knows the data should be aligned, because they found it to be more trouble than it's worth on modern hardware.

pclmulqdq · on March 31, 2024

IIRC MOV*A and MOV*U differ pretty significantly in how the secret stuff in the cache prefetching system treats them. The assertion is kind of nice, but there is a difference.

PartiallyTyped · on March 31, 2024

In general, Rust fields are padded such that they aligned to a multiple of their size. Rustc does not offer guarantees on the ordering of the fields. This holds for all types, not just 128bit values.

This allows rust programs to actually use less memory than C programs.

kibwen · on March 31, 2024

> Rustc does not offer guarantees on the ordering of the fields.

It's possible to guarantee the layout of a struct by opting into a specific representation, such as the `repr(C)` shown in the OP.

PartiallyTyped · on March 31, 2024

That's a special case and sits at the boundaries / bridge of languages.

> There is no indirection for these types; all data is stored within the struct, as you would expect in C. However with the exception of arrays (which are densely packed and in-order), the layout of data is not specified by default.

        struct A {
            a: i32,
            b: u64,
        }
        
        struct B {
            a: i32,
            b: u64,
        }

> Rust does guarantee that two instances of A have their data laid out in exactly the same way. However Rust does not currently guarantee that an instance of A has the same field ordering or padding as an instance of B.

https://doc.rust-lang.org/nomicon/repr-rust.html

codeflo · on March 31, 2024

It’s not the default, but it’s just a language feature. It’s also not just for interfacing with C. For example, you also use repr(C) for pointer tricks even if you never leave Rust. Lots of places in the standard library use it for that reason.

PartiallyTyped · on March 31, 2024

I don't understand the reaction.

> By default, composite structures have an alignment equal to the maximum of their fields' alignments. Rust will consequently insert padding where necessary to ensure that all fields are properly aligned and that the overall type's size is a multiple of its alignment. [...]

> There is no indirection for these types; all data is stored within the struct, as you would expect in C. However with the exception of arrays (which are densely packed and in-order), the layout of data is not specified by default.

        struct A {
            a: i32,
            b: u64,
        }
        
        struct B {
            a: i32,
            b: u64,
        }

> Rust does guarantee that two instances of A have their data laid out in exactly the same way. However Rust does not currently guarantee that an instance of A has the same field ordering or padding as an instance of B.

Here is my source: https://doc.rust-lang.org/nomicon/repr-rust.html

omoikane · on March 31, 2024

MSVC has `pragma pack` that removes padding to reduce memory usage, but I think it preserves field order:

https://learn.microsoft.com/en-us/cpp/preprocessor/pack?view...

GCC and Clang also support this pragma:

https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/Structure-Layo...

https://clang.llvm.org/docs/UsersManual.html#microsoft-exten...

PartiallyTyped · on March 31, 2024

That's right, you can pack things with a pragma directive in C and C++ for some compilers.

Rust can take it a bit further. In rust, fields must be aligned to a multiple of their size, and structs/tuples must be aligned to a multiple of their largest element. This combination makes it very easy to handle memory offsets, or reusing the same representation in a way that reduces memory usage (these are called niches)[1].

This is why dynamic linking is a bit more complex in rust. Alternatively you can offer a C bridge via `extern "c"`. C will act as a bridge between rust and different languages, and at that stage you need to take care of ordering and padding.

[1] https://doc.rust-lang.org/nomicon/repr-rust.html

riedel · on March 31, 2024

As the article explains this is purely about about compatible calling conversations between rust and C. I guess your question is also directed C compiler implementations that even overwrite LLVM defaults to achieve this alignment.

ww520 · on March 31, 2024

That was painful. Glad the alignment is fixed now.

maerF0x0 · on March 31, 2024

I had to look up what FFI meant - https://doc.rust-lang.org/nomicon/ffi.html#foreign-function-...

jupp0r · on March 31, 2024

__int128 is not a C type, it's a proprietary extension of GCC and clang (among others).

kryptiskt · on March 31, 2024

It's a C type, it's not in the standard, but there are a lot of types that aren't in the standard. And of course, it would have been in the standard long ago if they hadn't made a terrible mistake in C99 with intmax_t[0].

[0] A Special Kind of Hell - intmax_t in C and C++: https://thephd.dev/intmax_t-hell-c++-c

stephencanon · on March 31, 2024

_BitInt(128) is a C(23) type and has all of the same problems plus a few others.

tmgross · on April 1, 2024

Elaborating on the problems - _BitInt(128) has an alignment of 8, meaning that C's _BitInt(128) and __int128 are incompatible, similar to the Rust-C incompatibility that was just fixed :(.

https://groups.google.com/g/x86-64-abi/c/-JeR9HgUU20

stephencanon · on April 1, 2024

It's actually worse than that; from what I can piece together of the history:

  1. the people at Intel who originally implemented `_BitInt()` made `_BitInt(128)` eight-byte aligned
  2. the x86_64 psABI document was updated to agree with that
  3. it was implemented in clang in such a way that it also made `_BitInt(128)` eight-byte aligned on arm64 ...
  4. ... but the AAPCS says that it's sixteen-byte aligned
  5. ... and maybe the x86_64 psABI is going to change to say that it's sixteen byte aligned after-all.

So--as of a month ago when I was studying this because we're in the middle of enabling `[U]Int128` in Swift--_BitInt(128) didn't agree with __int128[_t] on either platform, and that's definitely a bug on arm64, and, although it behaves as documented on x86_64, that also might change.

shikon7 · on March 31, 2024

Still it is mentioned in the ABI specification, although as an optional type.

raverbashing · on March 31, 2024

Thank you compiler developers (this time) for giving resources that people use and find useful

(Still weird why would LLVM not align it to 16 bytes though)

fanf2 · on March 31, 2024

So much wtf in this blog post!

LLVM had multiple ABI conformance bugs in its implementation of int128, but clang had workarounds for these bugs that rustc lacked, causing interop problems. I wonder why they put workarounds in clang instead of fixing LLVM?!

This is a great example of one of the criticisms of LLVM, that language front ends must still implement processor-specific details themselves – iirc another example is struct layout.

codeflo · on March 31, 2024

Yes, LLVM has many bugs that Clang either doesn’t hit or has actively worked around, sometimes for years. There was never a huge push to fix that situation, which made using LLVM for other languages a disappointing experience. Rust has uncovered many of those bugs over time, and it seems like Rust is an important enough customer that those do get fixed now. Slowly.

For example, I remember that for many years, Rust would have to throw away part of its aliasing analysis because its annotations were too precise and would confuse LLVM’s optimizer. And so on. But it’s getting better.

ajross · on March 31, 2024

> I wonder why they put workarounds in clang instead of fixing LLVM?!

Among other things, because it would probably break rust (and other projects that had come to rely on the mistaken representation). Interoperability bugs like this often end up fundamentally unfixable in practice and the community ends up having to embrace bifurcated standards.

lamontcg · on March 31, 2024

And this is the kind of convoluted situation that happens when you have software that can never, ever break backwards compatibility.

It is probably entirely necessary to do it in this particular situation, but when people argue for no software ever breaking backward compatibility, they're arguing for more and more convoluted solutions like this. There's always a cost.

ajross · on March 31, 2024

> And this is the kind of convoluted situation that happens when you have software that can never, ever break backwards compatibility.

Toolchains are poster children for that genre, yeah. The ecosystems are too large and too interconnected to evolve in parallel. I mean, any one of LLVM, Rust and Clang are too large as a single project to implement a breaking change like this as a single event, trying to get them to do it in tandem just isn't going to happen. And even then there are uncounted other LLVM downstreams to worry about, all of whom need a story for how to migrate.

marcosdumay · on March 31, 2024

What is the exact kind of thing semantic versioning was meant to solve.

remexre · on March 31, 2024

LLVM breaks backwards compatibility all the time, though...

lamontcg · on March 31, 2024

Not really my point though.

I just want to get the people who scream on HN about how software should never ever break backcompat under any circumstance, in a room with the people on this issue who are stunned by how convoluted this situation is, and get them to fight it out. Bonus points for when those people are the same person on different days.

dylnuge · on March 31, 2024

Indeed, it looks like they even tried to fix it in LLVM (back in 2017) but wound up reverting it: https://reviews.llvm.org/D28990

fanf2 · on March 31, 2024

That’s one of the links in the original article, which postdates the workaround in clang. It doesn’t explain why the incorrect ABI was previously worked around in clang instead of fixed in LLVM.