Show HN: Sub-millisecond VM sandboxes using CoW memory forking

(github.com)

148 points by adammiribyan19 hours ago

20 comments

cperciva6 hours ago
Don't forget about entropy! You've just created two identical copies of all of your random number generators, which could be very very bad for security.The firecracker team wrote a very good paper about addressing this when they added snapshot support.
- adammiribyan3 hours ago
 Good callout. We seed entropy before snapshot to unblock getrandom(), but forks still share CSPRNG state. The proper fix per Firecracker’s docs is RNDADDENTROPY + RNDRESEEDCRNG after each fork, plus reseeding userspace PRNGs like numpy separately. On the roadmap. <a href="https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/random-for-clones.md" rel="nofollow">https://github.com/firecracker-microvm/firecracker/blob/main...</a>
 - mkj2 hours ago
 It looks like firecracker already supports ACPI vmgenid, which will trigger Linux random to reseed? <a href="https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md#reusing-snapshotted-states-securely" rel="nofollow">https://github.com/firecracker-microvm/firecracker/blob/main...</a><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virt/vmgenid.c" rel="nofollow">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...</a>So that just (!) leaves userspace PRNGs.
- Retr0id4 hours ago
 I suppose it'd be easy enough to re-seed RNGs, but re-relocating ASLR sounds like a pain. (Although I suppose for Python that doesn't matter)
 - cperciva3 hours ago
 Re-seeding is easy. The hard parts are (a) finding everything which needs to be reseeded -- not just explicit RNGs but also things like keys used to pick outgoing port numbers in a pseudorandom order -- and (b) making sure that all the relevant code becomes aware that it was just forked -- not necessarily trivial given that there's no standard "you just got restarted from a snapshot" signal in UNIX.
 - hinkley4 hours ago
 Off the cuff, the first step to ASLR is don’t publish your images and to rotate your snapshots regularly.The old fastCGI trick is to buffer the forking by idling a half a dozen or ten copies of the process and initialize new instances in the background while the existing pool is servicing new requests. By my count we are reinventing fastCGI for at least the fourth time.Long running tasks are less sensitive to the startup delays because we care a lot about a 4 second task taking an extra five seconds and we care much less about a 1 minute task taking 1:05. It amortizes out even in Little’s Law.
deivid30 minutes ago
Niiiiiice, I've been working on something like this, but reducing linux boot time instead of snapshot restore time; obviously my solution doesn't work for heavy runtimes
injidup1 hour ago
It's so frustrating seeing all this sandbox tooling pop up for linux but windows is soooooo far behind. I mean Windows Sandbox ( <a href="https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/" rel="nofollow">https://learn.microsoft.com/en-us/windows/security/applicati...</a> ) doesn't even have customizable networking white lists. You can turn networking on or off but that's about as fine grained as it gets. So all of us still having to write desktop windows stuff are left without a good method of easily putting our agents in a blast proof box.
- CTDOCodebases1 hour ago
 Web browsers don't even work properly in Windows Sandbox. There is a bug that hasn't been patched in over a year whereby web browsers can't use the GPU to render a page so all it displays is a white page. Users have to create a configuration file that turns off vGPU and launch Windows Sandbox from that.
crawshaw7 hours ago
Nice to see this work! I experimented with this for exe.dev before we launched. The VM itself worked really well, but there was a lot of setup to get the networking functioning. And in the end, our target are use cases that don't mind a ~1-second startup time, which meant doing a clean systemd start each time was easier.That said, I have seen several use cases where people want a VM for something minimal, like a python interpreter, and this is absolutely the sort of approach they should be using. Lot of promise here, excited to see how far you can push it!
- indigodaddy6 hours ago
 simonw seems like he's always wanting what you describe, maybe more for wasm though
 - edunteman3 hours ago
 I’ve been a big fan of “what’s the thinnest this could be” interpretations of sandboxes. This is a great example of that. On the other end of the spectrum there’s just-bash from the Vercel folks.
 - adammiribyan2 hours ago
 Exactly —- they skip the OS, we make it free to clone.
vmg127 hours ago
Does it only work with that specific version of firecracker and only with vms with 1 vcpu?More than the sub ms startup time the 258kb of ram per VM is huge.
- adammiribyan2 hours ago
 1 vCPU per fork currently. Multi-vCPU is doable (per-vCPU state restore in a loop) but would multiply fork time.On Firecracker version: tested with v1.12, but the vmstate parser auto-detects offsets rather than hardcoding them, so it should work across versions.
indigodaddy6 hours ago
Does this need passthrough or might we be able to leverage PVM with it on a passthrough-less cloud VM/VPS?
- dizhn1 hour ago
  I am not sure exactly what you are asking but firecracker does need access to /dev/kvm so nesting needs to be enabled on the VM.
indigodaddy6 hours ago
Your write-up made me think of:<a href="https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-seconds" rel="nofollow">https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...</a>Are there parallels?
diptanu7 hours ago
The tricky part of doing this in production is cloning sandboxes across nodes. You would have to snapshot the resident memory, file system (or a CoW layer on top of the rootfs), move the data across nodes, etc.
- adammiribyan3 hours ago
 Agreed, cross-node is the hard next step. For now single-node density gets you surprisingly far. 1000 concurrent sandboxes on one $50 box. When we need multi-node, userfaultfd with remote page fetch is the likely path.
- indigodaddy6 hours ago
 Is this relevant?<a href="https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-seconds" rel="nofollow">https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...</a>
latortuga6 hours ago
Similar to sprites.dev?
skwuwu17 hours ago
I noticed that you implemented a high-performance VM fork. However, to me, it seems like a general-purpose KVM project. Is there a reason why you say it is specialized for running AI agents?
- adammiribyan16 hours ago
  Fair question. The fork engine itself is general purpose -- you could use it for anything that needs fast isolated execution. We say 'AI agents' because that's where the demand is right now. Every agent framework (LangChain, CrewAI, OpenAI Assistants) needs sandboxed code execution as a tool call, and the existing options (E2B, Daytona, Modal) all boot or restore a VM/container per execution. At sub-millisecond fork times, you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.
yagizdagabak10 hours ago
Cool approach. Are you guys planning on creating a managed version?
- adammiribyan10 hours ago
  The API in the readme is live right now -- you can curl it. Plan is multi-region, custom templates with your own dependencies, and usage-based pricing. Email in my profile if you want early access.
- adammiribyan10 hours ago
  Thanks! Yes, there's going to be a managed version.
buckle80177 hours ago
This is how android processes work, but it's a security problem breaking some ASLR type things.
- hnperu53 hours ago
  [dead]
jauntywundrkind11 hours ago
I keep so so so many opencode windows going. I wish I had bought a better SSD, because I have so much swap space to support it all.I keep thinking I need to see if CRIU (checkpoint restore in userspace) is going to work here. So I can put work down for longer time, be able to close & restore instances sort of on-demand.I don't really love the idea of using VMs more, but I super love this project. Heck yes forking our processes/VMs.
- indigodaddy6 hours ago
 You could throw this on a VPS or server and it could help in that regard: (disclaimer, my thing)<a href="https://GitHub.com/jgbrwn/vibebin" rel="nofollow">https://GitHub.com/jgbrwn/vibebin</a>
- adammiribyan10 hours ago
 CRIU is great for save/restore. The nice thing about CoW forking is it's cheap branching, not just checkpointing. You can clone a running state thousands of times at a few hundred KB each.
handfuloflight8 hours ago
Can you run this in another sandbox? Not sure why you'd want to... but can you?
- Teknoman1177 hours ago
  Nested page tables / nested virtualization made it to consumer CPUs about a decade ago, so yes :)
- wmf8 hours ago
  It's pretty common to run VMs within containers so an attacker has to escape twice. You can probably disable 99% of system calls.
justboy19874 hours ago
[dead]
codance6 hours ago
[dead]
jauntywundrkind7 hours ago
Mods: can we merge with <a href="https://news.ycombinator.com/item?id=47412812">https://news.ycombinator.com/item?id=47412812</a>?
- tomhow1 hour ago
 Done, thanks!
wei032882 hours ago
[flagged]
- adammiribyan2 hours ago
 On tail latency: KVM VM creation is 99.5% of the fork cost - create_vm, create_irq_chip, create_vcpu, and restoring CPU state. The CoW mmap is ~4 microseconds regardless of load. P99 at 1000 concurrent is 1.3ms. The mmap CoW page faults during execution are handled transparently by the host kernel and don't contribute to fork latency.On snapshot staleness: yes, forks inherit all internal state including RNG seeds. For dependency updates you rebuild the template (~15s). No incremental update - full re-snapshot, similar to rebuilding a Docker image.On the memory number: 265KB is the fork overhead before any code runs. Under real workloads we measured 3.5MB for a trivial print(), ~27MB for numpy operations. But 93% of pages stay shared across forks via CoW. We measured 100 VMs each running numpy sharing 2.4GB of read-only pages with only 1.75MB private per VM. So the real comparison to E2B's ~128MB is more like 3-27MB depending on workload, with most of the runtime memory shared.
Jeffrin-dev3 hours ago
[flagged]