The intention here is to use wasm to allow you safely run user code _within_ the kernel. Their primary targets are nginx and FUSE. Conceivably, avoiding the context switch into and out of the kernel will have significant performance implications, but there aren't any numbers out yet for nginx specifically.
That's certainly a fascinating idea. My initial thought was "Wait, doesn't the kernel already provide a sandboxed execution environment -- called userspace?" This would still have scheduling overhead, but I assume the idea is to avoid a lot of the other context switching steps such as switching page tables. And instead rely on Wasm/JIT checks to ensure ahead of time that memory violations won't happen.
Once upon a time syscalls were slow, but architectures now provide features like syscall/sysenter for switching privilege levels, with costs comparable to userspace function calls.
Once upon a time switching page tables was slow, but now we have features like PCID that allow preserving buffers.
Soon, if not already, the principle cost to context switching will be the necessity to flush prediction and data buffers. In-kernel solutions like Wasmjit must incur the same costs. Quite possibly they may turn out to be slower overall: 1) they won't be able to take advantage of the same hardware optimized privilege management facilities (existing and future ones--imagine tagged prediction buffers much like PCID), and 2) they still incur the extra runtime overhead of running in a VM which, JIT-optimized or not, eats into limited resources like those prediction and data buffers that have become so critical to maximizing performance.
Granted, if it's going to work well at all than Nginx seems like a good bet, especially because of I/O. But there are many other solutions to that problem. Obsession with DPDK may be waning, but zero-copy AIO is still a thing and there are more ergonomic userspace alternatives (existing and in the pipeline) that let you leverage the in-kernel network stack without having to incur copying costs. And then there are solutions like QUIC that redefine the problem and which should work extremely well with existing zero-copy interfaces.
CPUs are incredibly complex precisely because so much of the security heavy-lifting once performed in the OS is being accomplished in the CPU or dedicated controllers. And these newer optimizations were designed to be integrated within the context of the traditional userspace/kernel split.
Wasmjit looks like an extremely cool project and I don't doubt its utility. There's plenty of room for alternative approaches, I just don't think the value-add is all that obvious.[1] Probably less to do with performance and more to do with providing a clear, stable, well-supported environment for solving (and subsequently maintaining!) difficult integration problems.
[1] I just want to reiterate that by saying the value-add isn't obvious I'm not implying anything about the potential magnitude of that value-add. I've been around long enough to understand that most pain points are invisible and just because I can't see them or people can't articulate them doesn't mean they don't exist or that the potential for serious disruption isn't there.
Just an honest question, could you elaborate on what methods are the: "there are more ergonomic userspace alternatives (existing and in the pipeline) that let you leverage the in-kernel network stack without having to incur copying costs". I've been curious about DPDK, FStack, Seastar, IncludeOS/MirageOS, etc. but wondering if there are easier ways to get the zero-copy IO.
Netmap - DPDK-like packet munging performance but with interfaces and semantics that behave more like traditional APIs. Signaling occurs through a pollable descriptor, meaning you can handle synchronization and work queueing problems much more like you would normally.
vmsplice - IIRC it recently became possible to be able to reliably detect when a page loan can be reclaimed, which is (or hopefully was) the biggest impediment to convenient use of vmsplice.
peeking - Until recently Linux poll/epoll didn't obey SO_RCVLOWAT, which made it problematic to peek at data before using splice() to shuttle data or dequeueing a connection request. I have a strong suspicion that before this fix many apps like SSL sniffers simply burnt CPU cycles without anybody realizing. Though in the Cloud age we seem much more tolerant of spurious, unreproducible latency and connectivity "glitches".
AIO - There's always activity around Linux's AIO interfaces. I don't keep track but there may have been a ring-buffer patch merged which allows dequeueing newly arrived events or data without having to poll for readiness first.
Device Passthru - CPU VM monitor extensions make it easier to work with devices directly. Not quite the same thing as traditional userspace/kernel interfaces, but it seems like people are increasingly running what otherwise look like (and implemented like) regular userpace apps within VM monitor frameworks. Like with Netmap all you really need is a singular notification primitive (possibly synthesized yourself) that allows you apply whatever model of concurrency you want--asynchronous, synchronous, or some combination--and in a way that is composable and friendly to regular userspace frameworks. VM monitor APIs and device pass thru permit arranging the burdens between userspace/VM and the kernel more optimally.
The entry and exit cost of a syscall is ~150 cycles. (Source: Many Google hits--blogs, papers--show people reciting 150 cycles exactly so I assume there's a singular, primary source for this. Maybe will track down the paper later.)
I'd say that's comparable. Many syscalls take much longer, but that's just because syscalls tend to be very abstract interfaces where each call performs costly operations or bookkeeping, especially on shared data structures requiring costly memory barriers. That doesn't mean the syscall interface itself is expensive. Microkernel skeptics stopped arguing syscall overhead a long time ago, and proponents are no longer defensive about it.
Direct call without args is about nearly 10 cycles on newish hardware, vdso is probably +5-10 syscall on the same CPU that returns something like the function will probably be 4-10x
Sure, but the context is relative interface and abstraction costs. Nginx running in Wasmjit in the kernel is unlikely to be making direct calls to internal kernel functions. Even if the JIT and framework were capable of that, I would think that Nginx would still be calling through an abstraction framework that provides proper read/recv semantics. It would be the sum of those intermediate calls until reaching the same point in the kernel that you'd want to compare.
This talk is relevant https://www.destroyallsoftware.com/talks/the-birth-and-death.... Tl;dw in-kernel JIT has the potential to be 4% faster than direct execution. I am still dubious, however, as JIT requires more resources by a long shot than direct binary running.
Anything receiving potentially malicious input should be untrusted and sandboxed if possible. That includes the network stack itself in high-assurance, security products. We also prefer simple, rigorously-analyzed software with high predictability. Other stuff often has vulnerabilities. Nginx is nearly 200,000 lines of code per an interview with CEO I just skimmed. Lwan, made for security and maintainability, is about 10,000 lines of code in comparison:
Lwan's actually small enough that mathematical verification for correctness against a spec is feasible, even though costly. Unlike Lwan, I could never have any hope of proving the correctness of Nginx. Even its safety would be difficult just because of all the potential code interactions on malicious input. Leak-free for secrets it contains? Forget about it. Best bet is to shove that thing either in a partition on a separation kernel/VMM or on a dedicated machine. The automated tooling for large programs does get better every year, though. One can use any compatible with Nginx. And still shove that humongous server into a deprivileged partition just in case. ;)
Last time I measured this, the time it took the Linux scheduler to decide what task to schedule was far more than the time it took the entry code and CPU to switch from user to kernel or vice verse. Meltdown changes this, but Meltdown-proof AMD CPUs are all over and Meltdown-proof Intel CPUs should show up eventually.
Everything said in the talk went true. Which means we are very close to a Nuclear War. ( And it certainly looks like a possibility at the way things are going )