Linux is unusual in OS kernels in that <i>direct</i> system calls from arbitrary userspace code are supported and ABI-stable. This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition.<p>If, instead, as on OpenBSD, the kernel enforced the rule that all system calls had to go through libc (or perhaps a big ntdll.dll-like VDSO), then the whole problem the linked article tries in vain to solve would disappear. If you wanted to hook a system call, you'd just change the libc/VDSO dispatch. No need to rewrite any instructions.<p>If I were Linus, I'd make a new rule: starting <i>today</i>, <i>all</i> new system calls <i>must</i> go through VDSO. No exceptions. SYSCALL from anywhere else? SIGKILL.<p>This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
Direct system calls are an amazing idea. The NtDll and bsd models are worse. The whole libc becomes a security boundary without the protection of kernel space. So much windows malware and process tampering happens because now you have a library (ntdll) fully in userspace that is given special privileges, which now becomes a huge attack surface. Then you have to deal with breakages between the built in libc versions and the kernel<p>This syscall overhead isn't as much as you suppose it is; for workloads where the syscall overhead actually makes a difference there are robust low-syscall paths for io/latency sensitive operations with DPDK, io_uring, and futex being a few examples.<p>And there are robust performant methods on linux for syscall interception/tracing, see seccomp unotify, bpf tracepoints, ftrace.
> This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition.<p>This model has always been a trade-off. It has downsides, but it also has upsides, including an immense boost in flexibility; decoupling from any particular userspace is useful.<p>> This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.<p><i>Can</i> you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
> <i>Can</i> you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.<p>The kernel puts the vDSO in memory and tells ld.so where it is, but where if anywhere ld.so will put it in the search order it implements is its own concern. (TBH I don’t actually know whether ld.so will actually allow LD_PRELOAD to override the vDSO, but there’s no reason for it not to, except I guess for the syscalls that are needed to perform the dynamic linking itself.)
> all system calls had to go through libc (or perhaps a big ntdll.dll-like<p>Which makes containers crap on Windows and *BSD as they have to run the currect libc or equivalent. Thus you need to build a different container per OS version which sucks compared to Linux.
thats why OpenBSD is unconvinient for development - because it binds to libc bloatware
> If I were Linus, I'd make a new rule<p>Or, you know, just propose your idea to him
Based on <a href="https://www.phoronix.com/news/Linus-Torvalds-No-Random-vDSO" rel="nofollow">https://www.phoronix.com/news/Linus-Torvalds-No-Random-vDSO</a> , I had been under the impression that he wasn't fond of adding more use of vDSO. On rereading, I can't tell if that's a vDSO thing or a preference against fast randomness being provided by the kernel.