Projects like this and Docker make me seriously wonder where software engineering is going. Don't get me wrong, I don't mean to criticize Docker or Toro in parcicular. It's the increasing dependency on such approaches that bothers me.<p>Docker was conceived to solve the problem of things "working on my machine", and not anywhere else. This was generally caused by the differences in the configuration and versions of dependencies. Its approach was simple: bundle both of these together with the application in unified images, and deploy these images as atomic units.<p>Somewhere along the lines however, the problem has mutated into "works on my container host". How is that possible? Turns out that with larger modular applications, the configuration and dependencies naturally demand separation. This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.<p>Now hardware virtualization. I like how AArch64 generalizes this: there are 4 levels of privilege baked into the architecture. Each has control over the lower and can call up the one immediately above to request a service. Simple. Let's narrow our focus to the lowest three: EL0 (classically the user space), EL1 (the kernel), EL2 (the hypervisor). EL0, in most operating systems, isn't capable of doing much on its own; its sole purpose is to do raw computation and request I/O from EL1. EL1, on the other hand, has the powers to directly talk to the hardware.<p>Everyone is happy, until the complexity of EL1 grows out of control and becomes a huge attack surface, difficult to secure and easy to exploit from EL0. Not good. The naive solution? Go a level above, and create a layer that will constrain EL1, or actually, run multiple, per-application EL1s, and punch some holes through for them to still be able to do the job—create a hypervisor. But then, as those vaguely defined "holes", also called system calls and hyper calls, grow, won't so the attack surface?<p>Or in other words, with the user space shifting to EL1, will our hypervisor become the operating system, just like docker-compose became a dynamic linker?
I see a number of assumptions in your post which I find not matching my view of the picture.<p>Containers arose as a way to solve the dependency problems created by traditional Unix. They grow from tools like chroot, BSD jails, and Solaris Zones. Containers allow to deploy dependencies that cannot be simultaneously installed on a traditional Unix host system. it's not a UNIX architecture limitation but rather a result of POSIX + tradition; e.g. Nix also solves this, but differently.<p>Containers (like chroot and jail before them) also help ensure that a running service does not depend on the parts of the filesystem it wasn't given access to. Additionally, containers can limit network access, and process tree access.<p>These limitations are not a proper security boundary, but definitely a dependency boundary, helping avoid spaghetti-style dependencies, and surprises like "we never realized that our ${X} depends on ${Y}".<p>Then, there's the Fundamental Theorem of Software Engineering [1], which states: "We can solve any problem by introducing an extra level of indirection." So yes, expect the number of levels of indirection to grow everywhere in the stack. A wise engineer can expect to merge or remove a some levels here and there, when the need for them is gone, but they would never expect that new levels of indirection should stop emerging.<p>[1]: <a href="https://en.wikipedia.org/wiki/Fundamental_theorem_of_software_engineering" rel="nofollow">https://en.wikipedia.org/wiki/Fundamental_theorem_of_softwar...</a>
To be honest, I've read your response 3 times and I still don't see where we disagree, assuming that we do.<p>I've mostly focused on the worst Docker horrors I've seen in production, extrapolating that to the future of containers, as pulling in new "containerized" dependencies will inevitably become just as effortless as it currently is with regular dependencies in the new-style high-level programming languages. You've primarily described a relatively fresh, or a well-managed Docker deployment, while admitting that spaghetti-style dependencies have become a norm and new layers will pile up (and by extension, make things hard to manage).<p>I think our points of view don't actually collide.
We do not disagree about the essence, but rather in accents. Some might say that sloppy engineers were happy to pack their Ruby-Goldbergesque deployments into containers. I say that even the most excellent and diligent engineers sometimes faced situations when two pieces of software required incompatible versions of a shared library, which depended on a tree of other libraries with incompatible versions, etc, and there's a practical limit of what you can and should do with bash scripts and abuse of LD_PRELOAD.<p>Many of the "new" languages, like Go (16 years), Rust (13 years), or Zig (9 years) just can build static binaries, not even depending on libc. This has both upsides and downsides, especially with security fixes. Rebuilding a container to include an updated .so dependency is often easier and faster than rebuilding a Rust project.<p>Docker (or preferably Podman) is not a replacement for linkers. It's an augmentation to the package system, and a replacement for the common file system layout, which is inadequate for modern multi-purpose use of a Unix (well, Linux) box.
I see, you're providing a complementary perspective. I appreciate that, and indeed, Docker isn't always evil. My intention was to bring attention to the abuse of it and compare it to virtualization of unikernels, which to me appears to be on a similar trajectory.<p>As for the linker analogy, I compared docker-compose (not Docker proper) to a dynamic linker because it's often used to bring up larger multi-container applications, similar to how large monolithic applications with plenty of shared library dependencies are put together by ld.so, and those multi-container applications can be similarly brittle if developed under the assumption that Docker will take care of all dependency problems, defeating most of the its advantages and reducing it to a pile of excess layers of indirection.
Containers got popular at at time when there were an increasingly number of people that were finding it hard to install software on their system locally - especially if you were, for instance, having to juggle multiple versions of ruby or multiple versions of python and those linked to various major versions of c libraries.<p>Unfortunately containers have always had an absolutely horrendous security story and they degrade performance by quite a lot.<p>The hypervisor is not going away anytime soon - it is what the entire public cloud is built on.<p>While you are correct that containers do add more layers - unikernels go the opposite direction and actively remove those layers. Also, imo the "attack surface" is by far the smallest security benefit - other architectural concepts such as the complete lack of an interactive userland is far more beneficial when you consider what an attacker actually wants to do after landing on your box. (eg: run their software)<p>When you deploy to AWS you have two layers of linux - one that AWS runs and one that you run - but you don't really need that second layer and you can have much faster/safer software without it.
I can understand the public cloud argument; if the cloud provider insists on you delivering an entire operating system to run your workloads, a unikernel indeed slashes the amount of layers you have to care about.<p>Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.<p>Technically, in the former case, besides exploiting the application, a hypothetical attacker will also have to exploit a flaw in QEMU to start processes or gain further privileges on the host, but that's just due to a layer of indirection. You can accomplish this without resorting to hardware virtualization. Once in QEMU, the entire assortment of your host's system calls and services is exposed, just as if you ran your code as a regular user space process.<p>This is the level you want to block exec() and other functionality your application doesn't need at, so that neither QEMU nor your code ran directly can perform anything out of their scope. Adding a layer of indirection while still leaving user/kernel, or unikernel/hypervisor junction points unsupervised will only stop unmotivated attackers looking for low-hanging fruit.
> In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls?<p>Hypervisors expose a much smaller API surface area to their tenants than an operating system does to its processes which makes them much easier to secure.
That is a artifact of implementation. Monolithic operating systems with tons of shared services expose lots to their tenants. Austere hypervisors, the ones with small API surface areas, basically implement a microkernel interface yet both expose significantly more surface area and offer a significantly worse guest experience than microkernels. That is why high security systems designed for multi-level security for shared tenants that need to protect against state actors use microkernels instead of hypervisors.
I can't speak for all the various projects but imo these aren't made for bare metal - if you want true bare metal (metal you can physically touch) use linux.<p>One of the things that might not be so apparent is that when you deploy these to something like AWS all the users/process mgmt/etc. gets shifted up and out of the instance you control and put into the cloud layer - I feel that would be hard to do with physical boxen cause it becomes a slippery slope of having certain operations (such as updates) needing auth for instance.
> other architectural concepts such as the complete lack of an interactive userland is far more beneficial when you consider what an attacker actually wants to do after landing on your box<p>What does that have to do with unikernel vs more traditional VMs? You can build a rootfs that doesn't have any interactive userland. Lots of container images do that already.<p>I am not a security researcher, but I wouldn't think it would be too hard to load your own shell into memory once you get access to it. At least, compared to pulling off an exploit in the first place.<p>I would think that merging kernel and user address spaces in a unikernel would, if anything, make it more vulnerable than a design using similar kernel options that did not attempt to merge everything into the kernel. Since now every application exploit is a kernel exploit.
A shell by design is explicitly made to run other programs. You type in 'ls', 'cd', 'cat', etc. but those are all different programs. A "webshell" can work to a degree as you could potentially upload files, cat files, write to files, etc. but you aren't running other programs under these conditions - that'd be code you're executing - scripting languages make this vastly easier than compiled ones. It's a lot more than just slapping a heavy-handed seccomp profile on your app.<p>Also merging the address space is not a necessity. In fact - 64-bit (which is essentially all modern cloud software) mandates virtual memory to begin with and many unikernel projects support elf loading.
> Unfortunately containers have always had an absolutely horrendous security story and they degrade performance by quite a lot.<p>This is demonstratably untrue.
Linux containers you mean.<p>The story is quite different in HP-UX, Aix, Solaris, BSD, Windows, IBM i, z/OS,...
Windows has containers?
With a standard windows server license you are only allowed to have a two hyperv virtual machines but unlimited "windows containers". The design is similar to Linux with namespaces bolted onto the main kernel so they don't provide any better security guaranies than Linux namespaces.<p>Very useful if you are packaging trusted software don't want to upgrade your windows server license.
Yes.<p>There are AppContainers. Those have existed for a while and are mostly targeted at developers intending to secure their legacy applications.<p><a href="https://learn.microsoft.com/en-us/windows/win32/secauthz/appcontainer-isolation" rel="nofollow">https://learn.microsoft.com/en-us/windows/win32/secauthz/app...</a><p>There's also Docker for Windows, with native Windows container support. This one is new-ish:<p><a href="https://learn.microsoft.com/en-us/virtualization/windowscontainers/about/" rel="nofollow">https://learn.microsoft.com/en-us/virtualization/windowscont...</a>
>what an attacker actually wants to do after landing on your box.<p>Aren't there ways of overwriting the existing kernel memory/extending it to contain an a new application if an attacker is able to attack the running unikernel?<p>What protections are provided by the unikernel to prevent this?
To be clear there are still numerous attacks one might lob at you. For instance you if you are running a node app and the attacker uploads a new js file that they can have the interpreter execute that's still an issue. However, you won't be able to start running random programs like curling down some cryptominer or something - it'd all need to be contained within that code.<p>What becomes harder is if you have a binary that forces you to rewrite the program in memory as you suggest. That's where classic page protections come into play such as not exec'ing rodata, not writing to txt, not exec'ing heap/stack, etc. Just to note that not all unikernel projects have this and even if they do it might be trivial to turn them off. The kernel I'm involved with (Nanos) has other features such as 'exec protection' which prevents that app from exec-mapping anything not already explicitly mapped exec.<p>Running arbitrary programs, which is what a lot of exploit payloads try to achieve, is pretty different than having to stuff whatever they want to run inside the payload itself. For example if you look at most malware it's not just one program that gets ran - it's like 30. Droppers exist solely to load third party programs on compromised systems.
If the stack and heap are non-executable and page tables can't be modified then it's hard to inject code. Whether unikernels actually apply this hardening is another matter.
I always thought of Docker as a "fuck it" solution. It's the epitomy of giving up. Instead of some department at a company releasing a libinference.so.3 and a libinference-3.0.0.x86_64.deb they ship some docker image that does inference and call it a microservice. They write that they launched, get a positive performance review, get promoted, and the Docker containers continue to multiply.<p>Python package management is a disaster. There should be ways of having multiple versions of a package coexist in /usr/lib/python, nicely organized by package name and version number, and import the exact version your script wants, without containerizing everything.<p>Electron applications are the other type of "fuck it" solution. There should be ways of writing good-looking native apps in JavaScript without actually embedding a full browser. JavaScript is actually a nice language to write front-ends in.
> Python package management is a disaster. There should be ways of having multiple versions of a package coexist in /usr/lib/python, nicely organized by package name and version number, and import the exact version your script wants, without containerizing everything.<p>Have you tried uv?
Well sure, every language has some band-aid. The real solution should have been Python itself supporting:<p><pre><code> import torch==2.9.1
</code></pre>
Instead of a bunch of other useless crap additions to the language, this should have been a priority, along with the ability for multiple versions to coexist in PYTHON_PATH.
There is a vast amount of complexity involved in rolling things from scratch today in this fractured ecosystem and providing the same experience for everyone.<p>Sometimes, the reduction of development friction is the only reason a product ends up in your hands.<p>I say this as someone whose professional toolkit includes Docker, Python and Electron; Not necessarily tools of choice, but I'm one guy trying to build a lot of things and life is short. This is not a free lunch and the optimizer within me screams out whenever performance is left on the table, but everything is a tradeoff. And I'm always looking for better tools, and keep my eyes on projects such as Tauri.
I think there's merit to your criticisms of the way docker is used, but it also seems like it provides substantial benefits for application developers. They don't need to beg OS maintainers to update the package, and they don't need to maintain builds for different (OS, version) targets any more.<p>They can just say "here's the source code, here's a container where it works, the rest is the OS maintainer's job, and if Debian users running 10 year old software bug me I'm just gonna tell them to use the container"
Agree on all fronts. The advent of Dockerfiles as a poor mans packaging system and the per-language package managers has set the industry back several years in some areas IMHO.
> JavaScript is actually a nice language to write front-ends in.<p>I've written my fair share of GUIs, and React (and thus Javascript) is great compared to, I don't know, PHP, but CSS is the absolute devil.
> This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.<p>The difference is that you can move that whole bunch of interlinked containers to another machine and it will work. You don't get that when running on bare hardware. The technology of "containers" is ultimately about having the kernel expose a cleaned up "namespaced" interface to userspace running inside the container, that abstracts away the details of the original machine. This is very much <i>not</i> intended as "sandboxing" in a security sense, but for most other system administration purposes it gets pretty darn close.
I’ve had similar concerns.<p>At some point, few people even understand the whole system and whether all these layers are actually accomplishing anything.<p>It’s especially bad when the code running at rarified levels is developed by junior engineers and “sold” as an opaque closed source thing. It starts to actually weaken security in some ways but nobody is willing to talk about that.<p>“It has electrolytes…”
> This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.<p>Yea, with uneeded bload like rule based access controls, ACS and secret management. Some comments on this site.
<i>This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.</i><p>I think you're over-egging the pudding. In reality, you're unlikely to use more than 2 types of container host (local dev and normal deployment maybe), so I think we've moved way beyond square 1. Config is normally very similar, just expressed differently, and being able to encapsulate dependencies removes a ton of headaches.
Nix is where we're going. Maybe not with the configuration language that annoys python devs, but declarative reproducible system closures are a joy to work with at scale.
> <i>Nix ... declarative reproducible system closures are a joy to work with ...</i><p>From what I read, I gather nixpkgs are more hermetic (as in Bazel [0]) & not reproducible? <a href="https://discourse.nixos.org/t/nixos-is-not-reproducible/42688/4" rel="nofollow">https://discourse.nixos.org/t/nixos-is-not-reproducible/4268...</a> / <a href="https://archive.vn/mXeih" rel="nofollow">https://archive.vn/mXeih</a><p>[0] <a href="https://bazel.build/basics/hermeticity" rel="nofollow">https://bazel.build/basics/hermeticity</a>
I've been running either Qubes OS or KVM/QEMU based VMs as my desktop daily driver for 10 years. Nothing runs on bare metal except for the host kernel/hypervisor and virt stack.<p>I've achieved near-native performance for intensive activities like gaming, music and visual production. Hardware acceleration is kind of a mess but using tricks like GPU passthrough for multiple cards, dedicated audio cards and and block device passthrough, I can achieve great latency and performance.<p>One benefit of this is that my desktop acts as a mainframe, and streaming machines to thin clients is easy.<p>My model for a long time has been not to trust anything I run, and this allows me to keep both my own and my client's work reasonably safe from a drive-by NPM install or something of that caliber.<p>Now that I also use a Apple Silicon MacBook as a daily driver, I very much miss the comfort of a fully virtualized system. I do stream in virtual machines from my mainframe. But the way Tahoe is shaping up, I might soon put Asahi on this machine and go back to a fully virtualized system.<p>I think this is the ideal way to do things, however, it will need to operate mostly transparently to an end user or they will quickly get security fatigue; the sacrifices involved today are not for those who lack patience.<p>Also, relevant XKCDs:<p><a href="https://www.explainxkcd.com/wiki/index.php/2044:_Sandboxing_Cycle" rel="nofollow">https://www.explainxkcd.com/wiki/index.php/2044:_Sandboxing_...</a><p><a href="https://www.explainxkcd.com/wiki/index.php/2166:_Stack" rel="nofollow">https://www.explainxkcd.com/wiki/index.php/2166:_Stack</a>
I think it's fine if you do it for yourself. It's a bit of a poor man's Linux-turned-microkernel solution. In fact, I work like this too, and this extends to my Apple Silicon Mac. The separation does have big security advantages, especially when different pieces of hardware are exclusively passed to the different, closed-off "partitions" of the system and the layer orchestrating everything is as minimal as it gets, or at least as guarded against the guests as it gets.<p>What worries me is when this model escalates from being cobbled up together by a system administrator with limited resources, to becoming baked into the design of software; the appropriation of the hypervisor layer by software developers who are reluctant to untangle the mess they've created at the user/kernel boundary of their program and instead start building on top of hardware virtualization for "security", to ultimately go on and pollute the hypervisor as the level of host OS access proves insufficient. This is beautifully portrayed by the first XKCD you've linked. I don't want to lose the ability to securely run VMs as the interface between the host and the guest OSes grows just as unmanageable as that of Linux and BSD system calls and new software starts demanding that I let it use the entirety of it, just like some already insists that I let it run as root because privilege dropping was never implemented.<p>If you develop software, you should know what kind of operating system access it needs to function and sandbox it appropriately, using the operating system's sandboxing facilities, not the tools reserved for system administrators.
> One benefit of this is that my desktop acts as a mainframe,<p>Are you for real? Tell us you've never worked on a mainframe without telling us you've ever worked on a mainframe.
If you're going to bring up ARM and EL levels, but not mention rings/CPL on x86, the discussion seems incomplete.