Native ZFS VDEV for Object Storage (OpenZFS Summit)

(zettalane.com)

127 points by suprasam24 days ago

10 comments

kev00924 days ago
I am curious about the inverse, using the dataset layer, to implement some higher level things like objects for an S3 compatible storage or pages directly for an RDBMS. I seem to remember hearing rumblings about that but it is hard to dredge up.
- p_l24 days ago
 ZFS-Lustre operates this way.Main issue with opening it further is lack of DMU-level userland API, especially given how syscall heavy it could get (and iouring might be locked out due to politics)
- magicalhippo24 days ago
 There was some work on this presented at one of the OpenZFS summits. However it never got submitted. Not sure if it remains a private feature or if they hit some roadblocks.In theory it should be a pretty good match considering internally ZFS is an object store.
- suprasam24 days ago
 For RDBMS pages on object storage - you might be thinking of Neon.tech. They built a custom page server for PostgreSQL that stores pages directly on S3.
infogulch24 days ago
How suitable would this be as a zfs send target to back up your local zfs datasets to object storage?
- suprasam24 days ago
 Yes, this is a core use case ZFS fits nicely. See slide 31 "Multi-Cloud Data Orchestration" in the talk.Not only backup but also DR site recovery.<pre><code> The workflow: 1. Server A (production): zpool on local NVMe/SSD/HD 2. Server B (same data center): another zpool backed by objbacker.io → remote object storage (Wasabi, S3, GCS) 3. zfs send from A to B - data lands in object storage Key advantage: no continuously running cloud VM. You're just paying for object storage (cheap) not compute (expensive). Server B is in your own data center - it can be a VM too. </code></pre> For DR, when you need the data in cloud:<pre><code> - Spin up a MayaNAS VM only when needed - Import the objbacker-backed pool - data is already there - Use it, then shut down the VM</code></pre>
 - infogulch23 days ago
 Could you do this with two separate zpools on the same server?<pre><code> zfs send -R localpool@[snapshot] | zfs recv -F objbackerpool </code></pre> Is there a particular reason why you'd want the objbacker pool to be a separate server?
- p_l24 days ago
 Quite probably should work just fine.The secret is that ZFS actually implements an object storage layer on top of block devices and only then implements ZVOL and ZPL (ZFS POSIX filesystem) on top of that.A "zfs send" is essentially a serialized stream of objects sorted by dependency (objects later in stream will refer to objects earlier in stream, but not the other way around).
PunchyHamster24 days ago
FS metrics without random IO benchmark are near meaningless, sequential read is best case for basically every file system and it's essentially "how fast you can get things from S3" in this case
- suprasam24 days ago
 It is all part of ZFS architecture with two tiers: - Special vdev (SSD): All metadata + small blocks (configurable threshold, typically <128KB) - Object storage: Bulk data only If the workload is randomized 4K small data blocks - that's SSD latency, not S3 latency.
- gigatexal24 days ago
 Yup. IIRC low queue depth random Reads are king for desktop usage
yjftsjthsd-h24 days ago
Could someone possibly compare this to <a href="https://www.zerofs.net/nbd-devices" rel="nofollow">https://www.zerofs.net/nbd-devices</a> ("zpool create mypool /dev/nbd0 /dev/nbd1 /dev/nbd2")
- suprasam24 days ago
 ZeroFS doesn't exploit ZFS strengths with no native ZFS support, just an afterthought with NBD + SlateDB LSM Good for small burst workloads where everything kept it in memory for LSM batch writes. Once compaction hits all bets off with performance and not sure about crash consistency since it is playing with fire. ZFS special vdev + ZIL on ssd is much safer. No need for LSM. MayaNAS ZFS metadata at SSD speed and large blocks get throughput from high latency S3 at network speed.
 - Eikon24 days ago
 ZeroFS author here.LSMs are “for small burst workloads kept in memory”? That’s just incorrect. “Once compaction hits all bets are off” suggests a misunderstanding of what compaction is for.“Playing with fire,” “not sure about crash consistency,” “all bets are off”Based on what exactly? ZeroFS has well defined durability semantics, guarantees which are much stronger than local block devices. If there’s a specific correctness issue, name it.“ZFS special vdev + ZIL is much safer”Safer how?
- 0x45724 days ago
 I know my missing something, but can't figure out: why not just one device?
 - yjftsjthsd-h24 days ago
 IIRC the point is that each NBD device is backed by a different S3 endpoint, probably in different zones/regions/whatever for resiliency.Edit: Oops, "zpool create global-pool mirror /dev/nbd0 /dev/nbd1" is a better example for that. If it's not that, I'm not sure what that first example is doing.
 - 0x45724 days ago
 In context of real AWS S3, I can see raid 0 being useful in this scenario, but in mirror that seems like too much duplication and cross-region replication like this going to introduce significant latency[citation needed]. AWS provides that for S3 already.I can see it on not real S3 though.
 - mgerdts24 days ago
 Mirroring between s3 providers would seemingly give protection against your account being locked at one of them.I expect this becomes most interesting with l2arc and cache (zil) devices to hold the working set and hide write latency. Maybe would require tuning or changes to allow 1m writes to use the cache device.
digiown24 days ago
Exciting stuff, but will this be merged? I remember another similar effort that went nowhere because the company decided to not proceed with it
curt1524 days ago
How does this relate to the work presented a few years ago by the ZFS devs using S3 as object storage? <a href="https://youtu.be/opW9KhjOQ3Q?si=CgrYi0P4q9gz-2Mq" rel="nofollow">https://youtu.be/opW9KhjOQ3Q?si=CgrYi0P4q9gz-2Mq</a>
- magicalhippo24 days ago
 Just going by the submitted article, it seems very similar in what it achieves, but seems to be implemented slightly differently. As I recall the DelphiX solution did not use a character device to communicate with the user-space S3 service, and it relied on a local NVMe backed write cache to make 16kB blocks performant by coalescing them into large objects (10 MB IIRC).This solution instead seems to rely on using 1MB blocks and store those directly as objects, alleviating the intermediate caching and indirection layer. Larger number of objects but less local overhead.DelphiX's rationale for 16 kB blocks was that their primary use-case was PostgreSQL database storage. I presume this is geared for other workloads.And, importantly since we're on HN, DelphiX's user-space service was written in Rust as I recall it, this uses Go.
- tw0424 days ago
 AFAIK it was never released, and it used FUSE, it wasn’t native.
terinjokes23 days ago
It doesn't look like the source has been released, nor any documentation outside this blog post and presentation. Is there a plan to open this up past what is used by MayaNAS and Zettalane's cloud offerings?
doktor2u24 days ago
That’s brilliant! Always amazed at how zfs keeps morphing and stays relevant!
glemion4324 days ago
I do not get it.Why would I use zfs for this? Isn't the power of zfs that it's a filesystem with checksum and stuff like encryption?Why would I use it for s3?
- mustache_kimono24 days ago
 > Why would I use it for s3?You have it the wrong way around. Here, ZFS uses many small S3 objects as the storage substrate, rather than physical disks. The value proposition is that this should be definitely cheaper and perhaps more durable than EBS.See s3backer, a FUSE implementation of similar: <a href="https://github.com/archiecobbs/s3backer" rel="nofollow">https://github.com/archiecobbs/s3backer</a>See prior in kernel ZFS work by Delphix which AFAIK was closed by Delphix management: <a href="https://www.youtube.com/watch?v=opW9KhjOQ3Q" rel="nofollow">https://www.youtube.com/watch?v=opW9KhjOQ3Q</a>BTW this appears to be closed too!
 - suprasam24 days ago
 Yes that is the value prop. Cheap S3 instead of expensive EBS.<pre><code> EBS limitations: - Per-instance throughput caps - Pay for full provisioned capacity whether filled or not S3: - Pay only for what you store - No per-instance bandwidth limits as long as you have network optimized instance</code></pre>
- bakies24 days ago
 I've got a massive storage server built that I want to run s3 protocol on it. It's already running ZFS. This is exactly what I want.zfs-share already implements SMB and NFS.
 - 0x45724 days ago
 This is not what it is. This is building zpool on top of an S3 backend (vdev).Not sure what is the use case out of my ignorance, but I guess one can use it to `zfs send` backups to s3 in a very neat manner.
 - lkjdsklf24 days ago
 One use case that comes to mind is backups. I can have a zpool created backed by a S3 vdev and then use zfs send | zfs recv to backup datasets to S3 ( or the billion other S3 like providers)Saves me the step of creating an instance with EBS volumes and snapshotting those to S3 or whateverhaven't done the math at all on whether that's cost effective, but that's the usecase that comes to mind immediately
 - suprasam24 days ago
 I hope you are not having that massive storage storage on public-cloud then you would need MayaNAS to reduce storage costs. For S3 as frontend use MinIO gateway - serves S3 API from your ZFS filesystem